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Abstract 

Over repeat presentations of the same stimulus, sensory neurons show variable responses. This "noise" is typically 
correlated between pairs of cells, and a question with rich history in neuroscience is how these noise correlations impact the 
population's ability to encode the stimulus. Here, we consider a very general setting for population coding, investigating 
how information varies as a function of noise correlations, with all other aspects of the problem - neural tuning curves, etc. 
- held fixed. This work yields unifying insights into the role of noise correlations. These are summarized in the form of 
theorems, and illustrated with numerical examples involving neurons with diverse tuning curves. Our main contributions 
are as follows. (1) We generalize previous results to prove a sign rule (SR) — if noise correlations between pairs of neurons 
have opposite signs vs. their signal correlations, then coding performance will improve compared to the independent case. 
This holds for three different metrics of coding performance, and for arbitrary tuning curves and levels of heterogeneity. 
This generality is true for our other results as well. (2) As also pointed out in the literature, the SR does not provide a 
necessary condition for good coding. We show that a diverse set of correlation structures can improve coding. Many of 
these violate the SR, as do experimentally observed correlations. There is structure to this diversity: we prove that the 
optimal correlation structures must lie on boundaries of the possible set of noise correlations. (3) We provide a novel set of 
necessary and sufficient conditions, under which the coding performance (in the presence of noise) will be as good as it 
would be if there were no noise present at all. 
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Introduction 

Neural populations typically show correlated variability over 
repeat presentation of the same stimulus [1-4]. These are called 
noise correlations, to differentiate them from correlations that arise 
when neurons respond to similar features of a stimulus. Such signal 
correlations are measured by observing how pairs of mean (averaged 
over trials) neural responses co-vary as the stimulus is changed 
[3,5]. 

How do noise correlations affect the population's ability to 
encode information? This question is well-studied [3,5-16], and 
prior work indicates that the presence of noise correlations can 
either improve stimulus coding, diminish it, or have little effect 
(Fig. 1). Which case occurs depends richly on details of the signal 
and noise correlations, as well as the specific assumptions made. 
For example [8,9,14] show that a classical picture — wherein 
positive noise correlations prevent information from increasing 
linearly with population size — does not generalize to heteroge- 
neously tuned populations. Similar results were obtained by [17], 
and these examples emphasize the need for general insights. 

Thus, we study a more general mathematical model, and 
investigate how coding performance changes as the noise 
correlation are varied. Figure 1, modified and extended from 
[5], illustrates this process. In this figure, the only aspect of the 



population responses that differs from case to case are the noise 
correlations, resulting in differendy shaped distributions. These 
different noise structures lead to different levels of stimulus 
discriminability, and hence coding performance. The different 
cases illustrate our approach: given any set of tuning curves and 
noise variances, we study how encoded stimulus information varies 
with respect to the set of all pairwise noise correlations. 

Compared to previous work in this area, there are two key 
differences that makes our analysis novel: we make no particular 
assumptions on the structure of the tuning curves; and we do not 
restrict ourselves to any particular correlation structure such as the 
"limited-range" correlations often used in prior work [5,7,8]. Our 
results still apply to the previously-studied cases, but also hold 
much more generally. This approach leads us to derive 
mathematical theorems relating encoded stimulus information to 
the set of pairwise noise correlations. We prove the same theorems 
for several common measures of coding performance: the linear 
Fisher information, the precision of the optimal linear estimator 
(OLE [18]), and the mutual information between Gaussian stimuli 
and responses. 

First, we prove that coding performance is always enhanced - 
relative to the case of independent noise - when the noise and 
signal correlations have opposite signs for all cell pairs (see Fig. 1). 
This "sign rule" (SR) generalizes prior work. Importantly, the 
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Author Summary 

Sensory systems communicate information to the brain — 
and brain areas communicate between themselves — via 
the electrical activities of their respective neurons. These 
activities are "noisy": repeat presentations of the same 
stimulus do not yield to identical responses every time. 
Furthermore, the neurons' responses are not independent: 
the variability in their responses is typically correlated from 
cell to cell. How does this change the impact of the noise 
— for better or for worse? Our goal here is to classify 
(broadly) the sorts of noise correlations that are either 
good or bad for enabling populations of neurons to 
transmit information. This is helpful as there are many 
possibilities for the noise correlations, and the set of 
possibilities becomes large for even modestly sized neural 
populations. We prove mathematically that, for larger 
populations, there are many highly diverse ways that 
favorable correlations can occur. These often differ from 
the noise correlation structures that are typically identified 
as beneficial for information transmission - those that 
follow the so-called "sign rule." Our results help in 
interpreting some recent data that seems puzzling from 
the perspective of this rule. 



converse is not true, noise correlations that perfectly violate the SR 
-and thus have the same signs as the signal correlations - can yield 
better coding performance than does independent noise. Thus, as 
previously observed [8,9,14], the SR does not provide a necessary 
condition for correlations to enhance coding performance. 

Since experimentally observed noise correlations often have the 
same signs as the signal correlations [3,6,19], new theoretical 
insights are needed. To that effect, we develop a new organizing 
principle: optimal coding will always be obtained on the boundary 
of the set of allowed correlation coefficients. As we discuss, this 
boundary can be defined in flexible ways that incorporate 
constraints from statistics or biological mechanisms. 

Finally, we identify conditions under which appropriately 
chosen noise correlations can yield coding performance as good 
as would be obtained with deterministic neural responses. For 
large populations, these conditions are satisfied with high 
probability, and the set of such correlation matrices is very high- 
dimensional. Many of them also strongly violate the SR. 

Results 

The layout of our Results section is as follows. We will begin by 
describing our setup, and the quantities we will be computing, in 
Section "Problem setup". 

In Section "The sign rule revisited", we will then discuss our 
generalized version of the "sign rule", Theorem 1, namely that 
signal and noise correlations between pairs of neurons with 
opposite signs will always improve encoded information compared 
with the independent case. Next, in Section "Optimal correlations 
lie on boundaries", we use the fact that all of our information 
quantities are convex functions of the noise correlation coefficients 
to conclude that the optimal noise correlation structure must lie on 
the boundary of the allowed set of correlation matrices, Theorem 
2. 

We will further observe that there will typically be a large set of 
correlation matrices that all yield optimal (or near-optimal) coding 
performance, in a numerical example of heterogeneously tuned 
neural populations in Section "Heterogeneously tuned neural 
populations". 



We prove that these observations are general in Section "Noise 
cancellation" by studying the noise canceling correlations (those 
that yield the same high coding fidelity as would be obtained in the 
absence of noise). We will provide a set of necessary and sufficient 
conditions for correlations to be "noise canceling", Theorem 3, 
and for a system to allow for these noise canceling correlations, 
Theorem 4. Finally, we will prove a result that suggests that, in 
large neural populations with randomly chosen stimulus response 
characteristics, these conditions are likely to be satisfied, Theorem 
5. 

A summary of most frequent notations we use is listed in 
Table 1. 

Problem setup 

We will consider populations of neurons that generate noisy 
responses x in response to a stimulus s. The responses, x - wherein 
each component x, represents one cell's response - can be 
considered to be continuous-valued firing rates, discrete spike 
counts, or binary "words", wherein each neuron's response is a 1 
("spike") or 0 ("not spike"). The only exception is that, when we 
consider / mu t,G (discussed below), the responses must be contin- 
uous-valued. We consider arbitrary tuning for the neurons; 
Hi =E{x,|?}. For scalar stimuli, this definition of "tuning" 
corresponds to the notion of a tuning curve. In the case of more 
complex stimuli, it is similar to the typical notion of a receptive 
field. Recall that the signal correlations are determined by the co- 
variation of the mean responses of pairs of neurons as the stimulus 
is varied, and thus they are determined by the similarity in the 
tuning functions. 

As for the structure of noise across the population, our analysis 
allows for the general case in which the noise covariance matrix 
C"j = cov(Xj,Xj\sj (superscript n denotes "noise") depends on the 
stimulus ?. This generality is particularly interesting given the 
observations of Poisson-like variability [20,21] in neural systems, 
and that correlations can vary with stimuli [3,16,19,22]. We will 
assume that the diagonal entries of the conditional covariance 
matrix - which describe each cells' variance - will be fixed, and 
then ask how coding performance changes as we vary the off- 
diagonal entries, which describe the covariance between the cell's 
responses (recall that the noise correlations are the pairwise 
covariances, divided by the geometric mean of the relevant 
variances p v = q/yQq). 

We quantify the coding performance with the following 
measures, which are defined more precisely in the Methods 
Section "Defining the information quantities, signal and noise 
correlations", below. First, we consider the linear Fisher informa- 
tion (/F.lin, Eq. (5)), which measures how easy it is to separate the 
response distributions that result from two similar stimuli, with a 
linear discriminant. This is equivalent to the quantity used by [1 1] 
and [10] (where Fisher information reduces to /f,Hii)- While Fisher 
information is a measure of local coding performance, we are also 
interested in global measures. 

We will consider two such global measures, the OLE 
information /ole (Eq. (12)) and mutual information for Gaussian 
stimuli and responses 7 m ut,G (Eq. (13)). /ole quantifies how well 
the optimal linear estimator (OLE) can recover the stimulus from 
the neural responses: large /ole corresponds to small mean 
squared error of OLE and vice versa. For the OLE, there is one set 
of read-out weights used to estimate the stimulus, and those 
weights do not change as the stimulus is varied. For contrast, with 
linear Fisher information, there is generally a different set of 
weights used for each (small) range of stimuli within which the 
discrimination is being performed. 
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Figure 1. Different structures of correlated trial-to-trial variability lead to different coding accuracies in a neural population. 

(Modified and extended from [5].) We illustrate the underlying issues via a three neuron population, encoding two possible stimulus values (yellow 
and blue). Neurons' mean responses are different for each stimulus, representing their tuning. Trial-to-trial variability (noise) around these means is 
represented by the ellips(oid)s, which show 95% confidence levels. This noise has two aspects: for each individual neuron, its trial-to-trial variance; 
and at the population level, the noise correlations between pairs of neurons. We fix the former (as well as mean stimulus tuning), and ask how noise 
correlations impact stimulus coding. Different choices (A-D) of noise correlations affect the orientation and shape of response distributions in 
different ways, yielding different levels of overlap between the full (3D) distributions for each stimulus. The smaller the overlap, the more 
discriminable are the stimuli and the higher the coding performance. We also show the 2D projections of these distributions, to facilitate the 
comparison with the geometrical intuition of [5]. First, Row A is the reference case where neurons' noise is independent: zero noise correlations. Row 
B illustrates how noise correlations can increase overlap and worsen coding performance. Row C demonstrates the opposite case, where noise 
correlations are chosen consistently with the sign rule (SR) and information is enhanced compared to the independent noise case. Intriguingly, Row 
D demonstrates that there are more favorable possibilities for noise correlations: here, these violate the SR, yet improve coding performance vs. the 
independent case. Detailed parameter values are listed in Methods Section "Details for numerical examples and simulations". 
doi:1 0.1 371 /journal.pcbi.1 003469.g001 



Consequently, in the case of /ole and / m ut,G, we will be 
considering the average noise covariance matrix Cy = COv(xi,Xj) 
= E{cov(x,,.Yy|?) }, where the expectation is taken over the 
stimulus distribution. Here we overload the notation C" be the 
covariance matrix that one chooses during the optimization, which 
will be either local (conditional covariances at a particular stimulus) 
or global depending on the information measure we consider. 

While /ole an d ^FJin are concerned with the performance of 
linear decoders, the mutual information 7 mu t,G between stimuli 
and responses describes how well the optimal read-out could 
recover the stimulus from the neural responses, without any 
assumptions about the form of that decoder. However, we 
emphasize that our results for 7 mu t,G only apply to jointly Gaussian 
stimulus and response distributions, which is a less general setting 



than the conditionally Gaussian cases studied in many places in 
the literature. An important exception is that Theorem 2 
additionally applies to the case of conditionally Gaussian 
distributions (see discussion in Section "Convexity of information 
measures"). 

For simplicity, we describe most results for scalar stimulus s if 
not stated otherwise, but the theory holds for multidimensional 
stimuli (see Methods Section "Defining the information quantities, 
signal and noise correlations"). 

The sign rule revisited 

Arguments about pairs of neurons suggest that coding 
performance is improved - relative to the case of independent, 
or trial-shuffled data - when the noise correlations have the 
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Table 1. 


Notations. 




S 


stimulus 


Xi 


response of neuron i 


IH 


mean response of neuron / 




derivative against s, Eq. (6) 


J 

Li 


covariance between x and s, Eq. (11) 


C" 


noise covariance matrix (averaged or conditional, Section 
"Summary of the problem set-up") 


c 


covariance of the mean response, Eq. (10) 


>0, ^0 


(a matrix is) positive definite and positive semidefinite 


r = c" + c" 


total covariance,Eq. (10) 




optimal readout vector of OLE, Eq. (9) 




noise correlations, Eq. (15) 




signal correlations, Eq. (16) 


^F,lin 


linear Fisher information, Eq. (5) 


^OLE 


OLE information (accuracy of OLE), Eq. (12) 


^mut,G 


mutual information for Gaussian distributions, Eq. (13) 


doi:1 0.1 371/journal.pcbi.l 003469.t001 



opposite sign from the signal correlations [5,7,10,13]: we dub this 
the "sign rule" (SR). This notion has been explored and 
demonstrated in many places in the experimental and theoretical 
literature, and formally established for homogenous positive 
correlations [10]. However, its applicability in general cases is 
not yet known. 

Here, we formulate this SR property as a theorem without 
restrictions on homogeneity or population size. 

Theorem 1. If for each pair of neurons, the signal and noise 
correlations have opposite signs, the linear Fisher information is greater than the 
case of independent noise (trial-shuffled data). In the opposite situation where 
the signs are the same, the linear Fisher information is decreased compared to 
the independent case, in a regime of very weak correlations. Similar results hold 
for 7oLE and / mu t,Gj with a modified definition of signal correlations given in 
Section. "Defining the information quantities, signal and noise 
correlations". 

In the case of Fisher information, the signal correlation between 
two neurons is defined as P// g = ||vjy)v^-|| 2 (Section "Defining the 
information quantities, signal and noise correlations"). Here, the 
derivatives are taken with respect to the stimulus. This definition 
recalls the notion of the alignment in the change in the neurons' 
mean responses in, e.g., [11]. It is important to note that this 
definition for signal correlation is locally defined near a stimulus 
value; thus, it differs from some other notions of "signal 
correlation" in the literature, that quantify how similar the whole 
tuning curves are for two neurons (see discussion on the alternative 
py S in Section "Defining the information quantities, signal and 
noise correlations"). We choose to define signal correlations for 
^F.lin, Ajle an d 7 mu t,G as described in Section "Defining the 
information quantities, signal and noise correlations" to reflect 
precisely the mechanism behind the examples in [5], among 
others. 

It is a consequence of Theorem 1 that the SR holds pairwise; 
different pairs of neurons will have different signs of noise 
correlations, as long as they are consistent with their (pairwise) 
signal correlations. The result holds as well for heterogenous 
populations. The essence of our proof of Theorem 1 is to calculate 
the gradient of the information function in the space of noise 
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Information as a function of correlations 



▲ 




Figure 2. The "sign rule" may fail to identify the globally 
optimal correlations. The optimal linear estimator (OLE) information 
Iole (Eq. (12)), which is maximized when the OLE produces minimum- 
variance signal estimates, is shown as a function of all possible choices 
of noise correlations (enclosed within the dashed line). These values are 
C" 2 = C" 3 (x-axis) and C" } (y-axis) for a 3-neuron population. The bowl 
shape exemplifies the general fact that /ole is a convex function and 
thus must attain its maximum on the boundary (Theorem 2) of the 
allowed region of noise correlations. The independent noise case and 
global optimal noise correlations are labeled by a black dot and triangle 
respectively. The arrow shows the gradient vector of /ole, evaluated at 
zero noise correlations. It points to the quadrant in which noise 
correlations and signal correlations have opposite signs, as suggested 
by Theorem 1. Note that this gradient vector, derived from the "sign 
rule", does not point towards the global maximum, and actually misses 
the entire quadrant containing that maximum. This plot is a two- 
dimensional slice of the cases considered in Fig. 3, while restricting 
C" 2 = C" 3 (see Methods Section "Details for numerical examples and 
simulations" for further parameters). 
doi:1 0.1 371/journal.pcbi.l 003469.g002 

correlations. We compute this gradient at the point representing 
the case where the noise is independent. The gradient itself is 
determined by the signal correlations, and will have a positive dot 
product with any direction of changing noise correlations that 
obeys the sign rule. Thus, information is increased by following the 
sign rule, and the gradient points to (locally) the direction for 
changing noise correlations that maximally improves the infor- 
mation, for a given strength of correlations. A detailed proof is 
included in Methods Section "Proof of Theorem 1 : the generality 
of the sign rule"; this includes a formula for the gradient direction 
(Remark 1 in Section "Proof of Theorem 1 : the generality of the 
sign rule"). We have proven the same result for all three of our 
coding metrics, and for both scalar, and multi-dimensional, 
stimuli. 

Intriguingly, there exists an asymmetry between the result on 
improving information (above), and the (converse) question of 
what noise correlations are worst for population coding. As we will 
show later, the information quantities are convex functions of the 
noise correlation coefficients (see Fig. 2). As a consequence, 
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performance will keep increasing as one continues to move along a 
"good" direction, for example indicated by the SR. This is what 
one expects when climbing a parabolic landscape in which the 
second derivative is always nonnegative. The same convexity result 
indicates that the performance will not decrease monotonically 
along a "bad" direction, such as the anti-SR direction. For 
example, if, while following the anti-SR direction, the system 
passed by the minimum of the information quantity, then 
continued increases in correlation magnitude would yield increases 
in the information. In fact, it is even possible for anti-SR 
correlations to yield better coding performance than would be 
achieved with independent noise. An example of this is shown in 
Fig. 2, where the arrow points in the direction in correlation space 
predicted by the SR, but performance that is better than with 
independent noise can also be obtained by choosing noise 
correlations in the opposite direction. 

Thus, the result that anti-SR noise correlations harm coding is 
only a "local" result — near the point of zero correlations - and 
therefore requires the assumption of weak correlations. We 
emphasize that this asymmetry of the SR is intrinsic to the 
problem, due to the underlying convexity. 

One obvious limitation of Theorem 1 and the "sign rule" results in 
general is that they only compare information in the presence of 
correlated noise with the baseline case of independent noise. This 
approach does not address the issue of finding the optimal noise 
correlations, nor does it provide much insight into experimental data 
that do not obey the SR. Does the sign rule rule describe optimal 
configurations? What are the properties of the global optima? How 
should we interpret noise correlations that do not follow the SR? We 
will address these questions in the following sections. 

Optimal correlations lie on boundaries 

Let us begin by considering a simple example to see what could 
happen for the optimization problem we described in Section 
"Problem setup", when the baseline of comparison is no longer 
restricted to the case of independent noise. This example is for a 
population of 3 neurons. In order to better visualize the results, we 
further require that C" 2 = C" 3 . Therefore the configurations of 
correlations is two dimensional. In Fig. 2, we plot information 
/ole as a function of the two free correlation coefficients (in this 
example the variances are all CJJ = 1, thus C" = p,y). 

First, notice that there is a parabola-shaped region of all 
attainable correlations (in Fig. 2, enclosed by black dashed lines 
and the upper boundary of the square). The region is determined 
not only by the entry-wise constraint \p t j\ < 1 (the square), but also 
by a global constraint that the covariance matrix C" must be 
positive semidefinite. For linear Fisher information and mutual 
information for Gaussian distributions, we further assume C^O 
(i.e. C is positive definite) so that /p.lin and Anut.G remain finite 
(see also Section "Defining the information quantities, signal and 
noise correlations"). As we will see again below, this important 
constraint leads to many complex properties of the optimization 
problem. This constraint can be understood by noting that 
correlations must be chosen to be "consistent" with each other and 
cannot be freely and independendy chosen. For example, if 
Pi 2 = Pi 3 are large and positive, then cells 2 and 3 will be 
positively correlated - since they both covary positively with cell 1 
— and p2 3 may thus not take negative values. In the extreme of 
Pi 2 = Pi 3 = 1, P2,3 is m Uy determined to be 1. Gases like this are 
reflecting the corner shape in the upper right of the allowed region 
in Fig. 2. 

The case of independent noise is denoted by a black dot in the 
middle of Fig. 2, and the gradient vector of /ole points to a 



quadrant that is guaranteed to increase information vs. the 
independent case (Theorem 1). The direction of this gradient 
satisfies the sign rule, as also guaranteed by Theorem 1 . However, 
the gradient direction and the quadrant of the SR both fail to 
capture the globally optimal correlations, which are at upper right 
corner of the allowed region, and indicated by the red triangle. 
This is typically what happens for larger, and less symmetric 
populations, as we will demonstrate next. 

Since the sign rule cannot be relied upon to indicate the global 
optimum, what other tools do we have at hand? A key observation, 
which we prove in the Methods Section "Proof of Theorem 2: 
optima lie on boundaries", is that information is a convex function 
of the noise correlations (off-diagonal elements of C"). This 
immediately implies: 

Theorem 2. The optimal C that maximize information must lie on the 
boundary of the region of correlations considered in the optimization. 

As we saw in Fig. 2, mathematically feasible noise correlations 
may not be chosen arbitrarily but are constrained by the fact that 
the noise covariance matrix be positive semidefinite. We denote 
this condition by C"^=0, and recall that it is equivalent to all of its 
eigenvalues being non-negative. According to our problem setup, 
the diagonal elements of C", which are the individual neurons' 
response variances, are fixed. It can be shown that this diagonal 
constraint specifies a linear slice through the cone of all C^O, 
resulting a bounded convex region in 

|n>JV(JV-l)/2 caUed a 

spectrahedron, for a population of N neurons. These spectrahedra 
are the largest possible regions of noise correlation matrices that 
are physically realizable, and are the set over which we optimize, 
unless stated otherwise. 

Importandy for biological applications, Theorem 2 will 
continue to apply, when additional constraints define smaller 
allowed regions of noise correlations within the spectrahedron. 
These constraints may come from circuit or neuron-level factors. 
For example, in the case where correlations are driven by common 
inputs [22,23], one could imagine a restriction on the maximal 
value of any individual correlation value. In other settings, one 
might consider a global constraint by restricting the maximum 
Euclidean norm (2-norm) of the noise correlations (defined in Eq. 
(18) in Methods). 

For a population of N neurons, there are N(N — l)/2 possible 
correlations to consider; naturally, as N increases, the optimal 
structure of noise correlations can therefore become more 
complex. Thus we illustrate the Theorem above with an example 
of 3 neurons encoding a scalar stimulus, in which there are 3 noise 
correlations to vary. In Fig. 3, we demonstrate two different cases, 
each with distinct (C) ;/ = COv(^ I ,/.!■) matrix and vector 
L, =cov(.y,^ i ) (values are given in Methods Section "Smumerics"). 
In the first case, there is a unique optimum (panel A, largest 
information is associated with the lightest color). In the second 
case, there are 4 disjoint optima (panel B), all of which lie on the 
boundary of the spectrahedron. 

In the next section, we will build from this example to a more 
complex one including more neurons. This will suggest further 
principles that govern the role of noise correlations in population 
coding. 

Heterogeneously tuned neural populations 

We next follow [8,9,15] and study a numerical example of a 
larger (iV = 20) heterogeneously tuned neural population. The 
stimulus encoded is the direction of motion, which is described by 
a 2-D vector ?=(cos(0),sin(0)) r . We used the same parameters 
and functional form for the shape of tuning curves as in [8] , the 
details of which are provided in Methods Section "Details for 
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Figure 3. Optimal coding is obtained on the boundary of the 
allowed region of noise correlations. For fixed neuronal responses 
variances and tuning curves, we compute coding performance - 
quantified by /ole information values - for different values of the pair- 
wise noise correlations. To be physically realizable, the correlation 
coefficients must form a positive semi-definite matrix. This constraint 
defines a spectrahedron, or a swelled tetrahedron, for the N = 3 cells 
used. The colors of the points represent /ole information values. With 
different parameters C and L (see values in Methods Section "Details 
for numerical examples and simulations"), the optimal configuration 
can appear at different locations, either unique (A) or attained at 
multiple disjoint places (B), but always on the boundary of the 
spectrahedron. In both panels, plot titles give the maximum value of 
Iole attained over the allowed space of noise correlations, and the 
value of /ole that would obtained with the given tuning curves, and 
perfectly deterministic neural responses. This provides an upper bound 
on the attainable /ole ( see text Section "Noise cancellation"). 



Interestingly, in panel (A), the noisy population achieves this upper 
bound on performance, but this is not the case in (B). Details of 
parameters used are in Methods Section "Details for numerical 
examples and simulations". 
doi:10.1371/journal.pcbi.1003469.g003 



numerical examples and simulation". The tuning curve for each 
neuron was allowed to have randomly chosen width and 
magnitude, and the trial-to-trial variability was assumed to be 
Poisson: the variance is equal to the mean. As shown in Fig. 4 A, 
under our choice of parameters the neural tuning curves - and by 
extension, their responses to the stimuli - are highly heterogenous. 
Once again, we quantify coding by /ole (see definition in Section 
"Problem setup" or Eq. (12) in Methods). 

Our goal with this example is to illustrate two distinct regimes, 
with different properties of the noise correlations that lead to 
optimal coding. In the first regime, which occurs closest to the case 
of independent noise, the SR determines the optimal correlation 
structure. In the second, moving further away from the 
independent case, the optimal correlations may disobey the SR. 
(A related effect was found by [8]; we return to this in the 
Discussion.) We accomplish this in a very direct way: for gradually 
increasing the (additional) constraint on the Euclidean norm of 
correlations (Eq. (18) in Methods Section "Defining the informa- 
tion quantities, signal and noise correlations"), we numerically 
search for optimal noise correlation matrices and compare them to 
predictions from the SR. 

In Fig. 4 B we show the results, comparing the information 
attained with noise correlations that obey the sign rule with those 
that are optimized, for a variety of different noise correlation 
strengths. As they must be, the optimized correlations always 
produce information values as high as, or higher than, the values 
obtained with the sign rule. 

In the limit where the correlations are constrained to be small, 
the optimized correlations agree with the sign rule; an example of 
these "local" optimized correlations is shown in Fig. 5 ADG, 
corresponding to the point labeled (i) in Fig. 4 BC. This is 
predicted by Theorem 1. In this "local" region of near-zero noise 
correlations, we see a linear alignment of signal and noise 
correlations (Fig. 5 D). As larger correlation strengths are reached 
(points (ii) and (Hi) in Fig. 4 BC), we observe a gradual violation of 
the sign rule for the optimized noise correlations. This is shown by 
the gradual loss of the linear relationship between signal and noise 
correlations in Fig. 4 D vs. E vs. F, as quantified by the R 2 statistic. 
Interestingly, this can happen even when the correlation 
coefficients continue have reasonably small values, and are 
broadly consistent with the ranges of noise correlations seen in 
physiology experiments [3,8,24]. 

The two different regimes of optimized noise correlations arise 
because, at a certain correlation strength, the correlation strength 
can no longer be increased along the direction that defines the sign 
rule without leaving the region of positive semidefinite covariance 
matrices. However, correlation matrices still exist that allow for 
more informative coding with larger correlation strengths. This 
reflects the geometrical shape of the spectrahedron, wherein the 
optima may lie in the "corners", as shown in Fig. 3. For these 
larger-magnitude correlations, the sign rule no longer describes 
optimized correlations, as shown with an example of optimized 
correlations in Fig. 5 CF. 

Fig. 5 illustrates another interesting feature. There is a diverse 
set of correlation matrices, with different Euclidean norms beyond 
the value of (roughly) 1.2, that all achieve the same globally 
optimal information level. As we see in the next section, this 
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Heterogeneously tuned population 



60 




strength of noise correlations strength of noise correlations 

Figure 4. Heterogeneous neural population and violations of the sign rule with increasing correlation strength. We consider signal 
encoding in a population of 20 neurons, each of which has a different dependence of its mean response on the stimulus (heterogeneous tuning 
curves shown in A). We optimize the coding performance of this population with respect to the noise correlations, under several different constraints 
on the magnitude of the allowed noise correlations. Panel (B) shows the resultant - optimal given the constraint - values of OLE information Iole, 
with different noise correlation strengths (blue circles). The strength of correlations is quantified by the Euclidean norm (Eq. (1 8)). For comparison, the 
red crosses show information obtained for correlations that obey the sign rule (in particular, pointing along the gradient giving greatest information 
for weak correlations); this information is always less than or equal to the optimum, as it must be. Note that correlations that follow the sign rule fail 
to exist for large correlation strengths, as the defining vector points outside of the allowed region (spectrahedron) beyond a critical length (labeled 
(ii)). For correlation strengths beyond this point, distinct optimized noise correlations continue to exist; the information values they obtain eventually 
saturate at noise-free levels (see text), which is 1 for the example shown here. This occurs for a wide range of correlation strengths. Panel (C) shows 
how well these optimized noise correlations are predicted from the corresponding signal correlations (by the sign rule), as quantified by the R 2 
statistic (between 0 and 1, see Fig. 5). For small magnitudes of correlations, the R 2 values are high, but these decline when the noise correlations are 
larger. 

doi:1 0.1 371 /journal.pcbi.1 003469.g004 



phenomenon is actually typical for large populations, and can be 
described precisely. 

Noise cancellation 

For certain choices of tuning curves and noise variances, 
including the examples in Fig. 3 A and Section "Heterogeneously 
tuned neural populations", we can tell precisely the value of the 
globally optimized information quantities — that is, the informa- 
tion levels obtained with optimal noise correlations. For the OLE, 
this global optimum is the upper bound on /ole- This is shown 
formally in Lemma 8, but it simply translates to an intuitive lower 
bound of the OLE error, similar to the data processing inequality 
for mutual information. This bound states that the OLE error 
cannot be smaller than the OLE error when there is no noise in 
the responses, i.e. when the neurons produce a deterministic 



response conditioned on the stimulus. This upper bound may — 
and often will (Theorem 5) — be achievable by populations of 
noisy neurons. 

Let us first consider an extremely simple example. Consider the 
case of two neurons with identical tuning curves, so that their 
responses are x, = )i(s) + n,, where n, is the noise in the response of 
neuron ;'e{l,2}, and ^(s) is the same mean response under 
stimulus s. In this case, the "noise free" coding is when n\ = m = 0 
on all trials, and the inference accuracy is determined by the shape 
of the tuning curve /((s) (whether or not it is invertible, for 
example). Now let us consider the case where the noise in the 
neurons' responses is non-zero but perfecdy anti correlated, so that 
n\ = — ;?2 on all trials. We can then choose the read-out as 
(x\ + Xi) / 2 = fi(s) to cancel the noise and achieve the same coding 
accuracy as the "noise free" case. 
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Noise vs. signal correlations at optima 
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Figure 5. In our larger neural population, the sign rule governs 
optimal noise correlations only when these correlations are 
forced to be very small in magnitude; for stronger correlations, 
optimized noise correlations have a diverse structure. Here we 
investigate the structure of the optimized noise correlations obtained in 
Fig. 4; we do this for three examples with increasing correlation 
strength, indicated by the labels ((),(//), (Hi) in that figure. First (ABC) 
show scatter plots of the noise correlations of the neural pairs, as a 
function of their signal correlations (defined in Methods Section 
"Defining the information quantities, signal and noise correlations"). 
For each example, we also show (DEF) a version of the scatter plot 
where the signal correlations have been rescaled in a manner discussed 
in Section "Parameters for Fig. 1, Fig. 2 and Fig. 3", that highlights the 
linear relationship (wherever it exists) between signal and noise 
correlations. In both sets of panels, we see the same key effect: the 
sign rule is violated as the (Euclidean) strength of noise correlations 
increases. In (ABC), this is seen by noting the quadrants where the dots 
are located: the sign rule predicts they should only be in the second 
and fourth quadrants. In (DEF), we quantify agreement with the sign 
rule by the R 2 statistic. Finally, (GHI) display histograms of the noise 
correlations; these are concentrated around 0, with low average values 
in each case. 

doi:10.1371/journal.pcbi.1003469.g005 



The preceding example shows that, at least in some cases, one 
can choose noise correlations in such a way that a linear decoder 
achieves "noise-free" performance. One is naturally left to wonder 
whether this observation applies more generally. 

First, we state the conditions on the noise covariance matrices 
under which the noise-free coding performance is obtained. We 
will then identify the conditions on parameters of the problem, i.e. 
the tuning curves (or receptive fields) and noise variances, under 
which this condition can be satisfied. Recall that the OLE is based 
on a fixed (across stimuli) linear readout coefficient vector A 
defined in Eq. (9) 

Theorem 3. A covariance matrix C attains the noise-free bound for 
OLE information (and hence is optimal), if and only if CA = 
C"(OT l L = 0. Here L is the cross-covariance between the stimuli 
responses (Eq. (11)), C is the covariance of the mean response (Eq. (10)), 
and A is the linear readout vector for OLE, which is the same as in the noise- 
free case — that is, A = (C" + C) Z, = (C) l L — when the condition 
is satisfied. 

We note that when the condition is satisfied, the conditional 
variance of the OLE is A T C"A = 0. This indicates that aU the 
error comes from the bias, if we as usual write the mean square 
error (for scalar s) in two parts, E{(S — .s) 2 } =E{var(S|.s)} + 
var(E{S|.s} — s). The condition obtained here can also be 
interpreted as "signal/readout being orthogonal to the noise." 
While this perspective gives useful intuition about the result, we 
find that other ideas are more useful for constructing proofs of this 
and other results. We discuss this issue more thoroughly in Section 
"The geometry of the covariance matrix". 

In general, this condition may not be satisfied by some choices 
of pairwise correlations. The above theorem implies that, given the 
tuning curves, the issue of whether or not such "noise free" coding 
is achievable will be determined only by the relative magnitude, or 
heterogeneity, of the noise variances for each neuron - the 
diagonal entries of C. The following theorem outiines precisely 
the conditions under which such "noise-free" coding performance 
is possible, a condition that can be easily checked for given 
parameters of a model system, or for experimental data. 

Theorem 4. For scalar stimulus, let q, = J AjCf t , i= 1, • • • ,N, 

where A = (C tl ) 'Z, is the readout vector for OLE in the noise-free case. 
Noise correlations may be chosen so that coding performance matches that which 
could be achieved in the absence of noise if and only if 

1 N 

max{ qi }<-J2<li- (!) 
L <=i 

When "< " is satisfied, all optimal correlations attaining the maximum form 
N(N-3) 

a dimensional convex set on the boundary of the spectrahedron. 

. N a (N a + Y) 

When "= " is attained, the dimension of that set is , where No is 

the number of zeros in {qf\. 

We pause to make three observations about this Theorem. First, 
the set of optimal correlations, when it occurs, is high-dimensional. 
This bears out the notion that there are many different, highly 
diverse noise correlation structures that all give the same (optimal) 
level of the information metrics. Second, and more technically, we 
note that the (convex) set of optimal correlations is flat (contained 
in a hyperplane of its dimension), as viewed in the higher 

N{N- 1) 

dimensional space IR 2 . A third intriguing implication of the 
theorem is that when noise-cancellation is possible, all optimal 
correlations are connected, as the set is convex (any two points are 
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connected by a linear segment that also in the set), and thus the 
case of disjoint optima as in Fig. 3 B will never happen when 
optimal coding achieves noise-free levels. Indeed, in Fig. 3 B, the 
noise-free bound is not attained. 

The high dimension of the convex set of noise-canceling 
correlations explains the diversity of optimal correlations seen in 
Fig. 4 B (i.e., with different Euclidean norms). Such a property is 
nontrivial from a geometric point of view. One may conclude 
prematurely that the dimension result is obvious if one considers 
algebraically the number of free variables and constraints in the 
condition of Theorem 3. This argument would give the dimension 
of the resulting linear space. However, as shown in the proof, there 
is another nontrivial step to show that the linear space has some 
finite part that also satisfies the positive semidefinite constraint. 
Otherwise, many dimensions may shrink to zero in size, as 
happens at the corner of the spectrahedron, resulting in a small 
dimension. 

The optimization problem can be thought of as finding the level 
set of information function associated with as large as possible 
value while still intersecting with the spectrahedron. The level sets 
are collections of all points where the information takes the same 
value. These form high dimensional surfaces, and contain each 
other, much as layers of an onion. Here these surfaces are also 
guaranteed to be convex as the information function itself is. Next, 
note from Fig. 3 that we have already seen that the spectrahedron 
has sharp corners. Combining this with our view of the level sets, 
one might guess that the set of optimal solutions — i.e. the 
intersection — should be very low dimensional. Such intuition is 
often used in mathematics and computer science, e.g. with regards 
to the sparsity promoting tendency of LI optimization. The high 
dimensionality shown by our theorem therefore reflects a 
nontrivial relationship between the shape of the spectrahedron 
and the level sets of the information quantities. 

Although our theorem only characterizes the abundance of the 
set of exact optimal noise correlations, it is not hard to imagine the 
same, if not more, abundance should also hold for correlations that 
approximately achieve the maximal information level. This is 
indeed what we see in numerical examples. For example, note the 
long, curved level-set curves in Fig. 2 near the boundaries of the 
allowed region. Along these lines lie many different noise 
correlation matrices that all achieve the same nearly-optimal 
values of /ole ■ The same is true of the many dots in Fig. 3 A that 
all share a similar "bright" color corresponding to large /ole- 

One may worry that the noise cancellation discussed above is 
rarely achievable, and thus somewhat spurious. The following 
theorem suggests that the opposite is true. In particular, it gives 
one simple condition under which the noise cancellation 
phenomenon, and resultant high-dimensional sets of optimal noise 
correlation matrices, will almost surely be possible in large neural 
populations. 

Theorem 5. If the {qi} defined in Theorem 4 are independent and 
identically distributed (i.i.d.) as a random variable X on [0,oo) with 
0<E{X}<go, then the probability 

P(the inequality of Eq. (1) is satified)^ 

(2) 

1, as JV->oo. 

In actual populations, the (?, might not be well described as i.i.d.. 
However, we believe that the conditions of the inequality of Eq.(l) 
are still likely to be satisfied, as the contrary seems to require one 
neuron with highly outlying tuning and noise variance value (a few 
comparable outiiers won't necessary violate the condition, as their 



magnitudes will enter on the right hand side of the condition, thus 
the condition only breaks with a single "outlier of outliers"). 

Discussion 

Summary 

In this paper, we considered a general mathematical setup in 
which we investigated how coding performance changes as noise 
correlations are varied. Our setup made no assumptions about the 
shapes (or heterogeneity) of the neural tuning curves (or receptive 
fields), or the variances in the neural responses. Thus, our results - 
which we summarize below - provide general insights into the 
problem of population coding. These are as follows: 

• We proved that the sign rule — if signal and noise correlations 
have opposite signs, then the presence of noise correlations will 
improve encoded information vs. the independent case — 
holds for any neural population. In particular, we showed that 
this holds for three different metrics of encoded information, 
and for arbitrary tuning curves and levels of heterogeneity. 
Furthermore, we showed that, in the limit of weak correlations, 
the sign rule predicts the optimal structure of noise correlations 
for improving encoded information. 

• However, as also found in the literature (see below), the sign 
rule is not a necessary condition for good coding performance 
to be obtained. We observed that there will typically be a 
diverse family of correlation matrices that yield good coding 
performance, and these will often violate the sign rule. 

• There is significantly more structure to the relationship 
between noise correlations and encoded information than that 
given by the sign rule alone. The information metrics we 
considered are all convex functions with respect to the entries in 
the noise correlation matrix. Thus, we proved that the optimal 
correlation structures must lie on boundaries of any allowed 
set. These boundaries could come from mathematical 
constraints - all covariance matrices must be positive 
semidefinite - or mechanistic/biophysical ones. 

• Moreover, boundaries containing optimal noise correlations 
have several important properties. First, they typically contain 
correlation matrices that lead to the same high coding fidelity 
that one could obtain in the absence of noise. Second, when 
this occurs there is a high-dimensional set of different 
correlation matrices that all yield the same high coding fidelity 
— and many of these matrices strongly violate the sign rule. 

• Finally, for reasonably large neural populations, we showed 
that both the noise-free, and more general SR-violating 
optimal, correlation structures emerge while the average noise 
correlations remain quite low — with values comparable to 
some reports in the experimental literature. 

Convexity of information measures 

Convexity of information with respect to noise correlations 
arises conceptually throughout the paper, and specifically in 
Theorem 2. We have shown that such convexity holds for all three 
particular measures of information studied above (Ifm, ^ole, and 
-fmut.G)- Here, we show that these observations may reflect a 
property intrinsic to the concept of information, so that our results 
could apply more generally. 

It is well known that mutual information is convex with respect 
to conditional distributions. Specifically, consider two random 
variables (or vectors) x, y, each with conditional distribution 
x\s~pi(x\s) and y\s~pi(y\s) (with respect the random "stimulus" 
variable(s) s). Suppose another variable z has a conditional 
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distribution given by a nonnegative linear combination of the two, 
z\s~p(z\s) = ot,pi(z\s) + (l — tx)p2(z\s), ae[0,l]. The mutual infor- 
mation must satisfy I(z,s)<xl(x,s) + (l — a)I(y,s). Notably, this 
fact can be proved using only the axiomatic properties of mutual 
information (the chain rule for conditional information and 
nonnegativity) [25]. 

It is easy to see how this convexity in conditional distributions is 
related to the convexity in noise correlations we use. To do this, we 
further assume that the two conditional means are the same, 
E{x|?} = E{j|?}, and let x, y be random vectors. Introduce an 
auxiliary Bernoulli random variable T that is independent of s, with 
probability a of being 1. The variable z can then be explicitly 
constructed using T: for any s, draw z according to P\(z\s) if T= 1 
and according to pi(z\s) otherwise. Using the law of total covariance, 
the covariance (conditioned on s) between the i,j elements of z is 

covj(z,-,z/) = E{covj(z,-,z/ |T)}+ covj(Ej{z,- I T},E s {zj\T}) 



= aco Vj(z; ,zj I T = 1 ) + ( 1 — a)co Vj(z,- ,zj | T = 0) + 0 



= acoviC*,-,*,-) + (1 - a)covj();„j,). 

This shows that the noise covariances are expressed accordingly as 
linear combinations. If the information depends only on covariances 
(besides the fixed means), as for the three measures we consider, the 
two notions of convexity become equivalent. A direct corollary of this 
argument is that the convexity result of Theorem 2 also holds in the 
case of mutual information for conditionally Gaussian distributions 
(i.e., such that X given s is Gaussian distributed). 

Sensitivity and robustness of the impact of correlations 
on encoded information 

One obvious concern about our results, especially those related 
to the "noise-free" coding performance, is that this performance 
may not be robust to small perturbations in the covariance matrix 
- and thus, for example, real biological systems might be unable to 
exploit noise correlations in signal coding. This issue was recently 
highlighted, in particular, by [26]. 

At first, concerns about robustness might appear to be alleviated 

by our observation that there is typically a large set of possible 

correlation structures that all yield similar (optimal) coding 

performance (Theorem 4). However, if the correlation matrix 

was perturbed along a direction orthogonal to the level set of the 

information quantity at hand, this could still lead to arbitrary 

changes in information. To address this matter directly, we 

explicitly calculated the following upper bound for the sensitivity 

of information, or condition number K with respect (sufficiently small) 

perturbations. The condition number K is defined as the ratio of 

relative change in the function to that in its variables. For example, 

the condition number corresponding to perturbing C is the 

\M Flin \ \\AC"\\ 
smallest number Kp ii n -C" that satisfies — - — '— < Kp \ m o ,, ^ 

\iF,lin\ II <-," || 

Similarly one can define condition number Kp.imV/i f° r perturbing 

the tuning of neurons Vl(. 

Proposition 6. The local condition number of /p.lin under 

perturbations of C (where magnitude is quantified by 2-norm) is bounded by 



2) 

KFM-.cn < 2k 2 (C) : = 2 1| {C)~ 1 || 2 - 1| C || 2 = , (3) 

where l m . dx and A mm are the largest and smallest eigenvalue of C" 



respectively. Here k 2 is the condition number with respect to the 2-norm, as 
defined in the above equation. 

Similarly, the condition number for perturbing ofWfi is bounded by 



KF,lin:V/i < 



max/||(Vft) v -|| 2 
min,-||(V/i). ,-|| 2 ' 



(4) 



where ( V/x). ,• is the i-th column ofWfi and assume (V/x). ,■ # 0 for all i. Here 
K is the dimension of the stimulus s. 

Though stated for /p.lin, same results also hold for /ole when 
replacing C by C + C" in Eq. (3) and (4). We believe that a 
similar property is possible to derive for mutual information 7 mu t,G, 
but that the expression could be quite cumbersome; we do not 
pursue this further here. 

To interpret this Proposition, we make the following observa- 
tions which explain when the sensitivity or condition numbers will 
(or will not) be themselves reasonable in size, for given noise 
correlations C . In our setup, the diagonal of C (or C + C for 
OLE) is fixed, and therefore A max is bounded (Gershgorin circle 
theorem). As long as C (or C + C) is not close to singular, the 
information should therefore be robust, i.e. with a reasonably 
bounded condition number. For OLE, as C + C^C, we always 
have a universal bound of K determined only by C. For the linear 
Fisher information, however, nearly singular C can more 
typically occur near optimal solutions; in these cases, the condition 
numbers will be very large. 

Relationship to previous work 

Much prior work has investigated the relationship between 
noise correlations and the fidelity of signal coding [3,5-1 1,13-16]. 
Two aspects of our current work complement and generalize those 
studies. 

The first are our results on the sign rule (Section "The sign rule 
revisited"). Here, we find that, if each cell pair has noise 
correlations that have the opposite sign vs. their signal correlations, 
the encoded information is always improved, and that, at least in 
the case of weak noise correlations, noise correlations that have the 
same sign as the signal correlations will diminish encoded 
information. This effect was observed by [6] for neural populations 
with identically tuned cells. Since the tuning was identical in their 
work, all signal correlations were positive. Thus, their observation 
that positive noise correlations diminish encoded information is 
consistent with the SR results described above. 

Relaxing the assumption of identical tuning, several studies 
followed [6] that used cell populations with tuning that differed 
from cell to cell, but maintained some homogeneous structure — 
i.e., identically shaped, and evenly spaced (along the stimulus axis) 
tuning curves, e.g., [5,7]. The models that were investigated then 
assumed that the noise correlation between each cell pair was a 
decaying function of the displacement between the cells' tuning 
curve peaks. The amplitude of the correlation function - which 
determines the maximal correlation over all cell pairs, attained for 
"nearby" cells - was the independent variable in the numerical 
experiments. Recall that these nearby (in tuning-curve space) cells, 
with overlapping tuning curves, will have positive signal correla- 
tions. These authors found that positive signs of noise correlations 
diminished encoded information, while negative noise correlations 
enhanced it. This is once again broadly consistent with the sign 
rule, at least for nearby cells which have the strongest correlation. 
Finally, we note that [5,10,12] give a crisp geometrical interpre- 
tation of the sign rule in the case of N = 2 cells. 

At the same time, experiments typically show noise correlations 
that are stronger for cell pairs with higher signal correlations 
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[3,6,19], which is certainly not in keeping with the sign rule. This 
underscores the need for new theoretical insights. To this effect, we 
demonstrated that, while noise correlations that obey the sign rule 
are guaranteed to improve encoded information relative to the 
independent case, this improvement can also occur for a diverse 
range of correlation structures that violate it. (Recall the 
asymmetry of our findings for the sign rule: noise correlations 
that violate the sign rule are only guaranteed to diminish encoded 
information if those noise correlations are very weak). 

This finding is anticipated by the work of [8,9,14], who used 
elegant analytical and numerical studies to reveal improvements in 
coding performance in cases where the sign rule was violated. 
They studied heterogeneous neural populations, with, for exam- 
ple, different maximal firing rates for different neurons. In 
particular, these authors show how heterogeneity can simulta- 
neously improve the accuracy and capacity of stimulus encoding 
[14], or can create coding subspaces that are nearly orthogonal to 
directions of noise covariance [8,9]. Taken together, these studies 
show that the same noise correlation structure discussed above - 
with nearby cells correlated - could lead to improved population 
coding, so long as the noise correlations are sufficiendy strong. [8] 
also demonstrated that the magnitude of correlations needed to 
satisfy the "sufficiendy strong" condition decreases as the 
population size increases, and that in the large N limit, certain 
coding properties become invariant to the structure of noise 
correlations. Overall, these findings agree with our observations 
about a large diversity of SR-violating noise correlation structures 
that improve encoded information. 

One final study requires its own discussion. Whereas the current 
study (and those discussed above) investigated how coding relates 
to noise correlations with no concerns for the biophysical origin of 
those correlations [17], studied a semi-mechanistic model in which 
noise correlations were generated by inter-neuronal coupling. 
They observed that coupling that generates anti-SR correlations is 
beneficial for population coding when the noise level is very high, 
but that at low noise levels, the optimal population would follow 
the SR. Understanding why different mechanistic models can 
display different trends in their noise correlations is important, and 
we are currently investigating that issue. 

The geometry of the covariance matrix 

One geometrical, and intuitively helpful, way to think about 
problems involving noise correlations is to ask when the noise is 
"orthogonal to the signal": in these cases, the noise can be separated 
from or be orthogonal to the signal, and high coding performance is 
obtained. This geometrical view is equally valid for the cases we 
study (e.g., the conditions we derive in Theorem 3), and is implicit in 
the diagrams in Figure 1 . To make the approach explicit, one could 
perform an eigenvector analysis on the covariance matrices at hand, 
where quantities like linear Fisher information are rewritten as a 
sum of projections of the tuning vector to the eigen-basis of the 
covariance matrix, weighted by the appropriate eigenvalues. 

This invites the question of whether a simpler way to obtain the 
results in our paper wouldn't be to consider how covariance 
eigenvectors and eigenvalues could be manipulated more directly. 
For example, if one could simply "rotate" the eigenvectors of the 
covariance matrix out of the signal direction — or shrink the 
eigenvalues in that direction — one would necessarily improve 
coding performance. So why don't we simply do this when 
exploring spaces of covariance matrices? The reason is that these 
eigenvalue and eigenvector manipulations are not as easy to enact 
as they might at first sound (to us, and possibly to the reader). 
Recall that we asked how noise correlations affect coding subject 
to the specific constraint that the noise variance of each neuron is 



fixed, which translates in general to rather complex constraints on 
the eigenvalues and eigenvectors. For example, the eigenvalues of 
a fixed-diagonal covariance matrix cannot be equivakntly described 
by simply having a fixed sum (which is a necessary condition for 
the diagonals to be constant, but is not a sufficient one). These 
facts limit the insights that a direct approach to adjusting 
eigenvalues and eigenvectors can have for our problem, and 
emphasize the non-trivial nature of our results. 

An exception comes, for example, in special cases when the 
covariance matrix has a circulant structure, and consequendy 
always has the Fourier basis for eigenvectors. These cases include 
many situations considered in the literature [8,10]. For contrast, 
the covariance matrices we studied were allowed to change freely, 
as long as the diagonals remained fixed. 

Limitations and extensions 

We have developed a rich picture of how correlated noise 
impacts population coding. For our results on noise cancellation in 
particular, this was done by allowing noise correlations to be 
chosen from the largest mathematically possible space (i.e., the 
entire spectrahedron). This describes the fundamental structure of 
the problem at hand, but are conclusions derived in this way 
important for biology? It is not hard to imagine many biological 
constraints that may further limit the range of possible noise 
correlations (e.g., limits on the strength of recurrent connections or 
shared inputs). On the one hand, the likelihood that the underlying 
phenomena could be found in biological systems seems increased 
by the fact that many different correlation matrices will suffice for 
noise free coding and that, as we discuss in Proposition 6, 
information levels appear to have some robustness under 
perturbations of the underlying correlation matrices. 

However, care must still be taken in interpreting what we mean 
by "noise free." As emphasized by, e.g., [8,27], noise upstream from 
the neural population in question can never be removed in 
subsequent processing. Therefore, the "noise free" bound we 
discuss in Lemma 8 should not allow for a higher information level 
than that determined by this upstream noise. In some cases, this fact 
could lead to a consistency requirement on either the set of signal 
correlations C, the set of allowed noise correlations C, or both. To 
specify these constraints and avoid possible over-interpretations of 
the abstract coding model as we study, one could combine a explicit 
mechanistic model with the present approach. 

On another note, we have asked what noise correlations allow 
for linear decoders to best recover the stimulus from the set of 
neural population responses. At the same time, there is reason to 
be wary of linear decoders [28] (see also [16]), as they might miss 
significant information that is only accessible via a non-linear read- 
out. Furthermore, given the non-linearity inherent in dendritic 
processing and spike generation [29], there is added motivation to 
consider information without assuming linearity. 

Furthermore, we have herein restricted ourselves to asking 
about pairwise noise correlations, while there are many studies 
that identify higher-order correlations (HOC) in neural data 
[30,31], and some numerical results [32] that hint at when those 
HOC are beneficial for coding. In light of this study, it is 
interesting to ask whether we can derive a similarly general theory 
for HOC, and to investigate how the optimal pairwise and higher- 
order correlations interrelate. Note this issue is closely related to 
the type of decoder that is assumed: the performance of linear 
decoder (as measured by mean squared error) depends on the 
pairwise correlations, but not HOC. Therefore the effect of HOC 
must be studied in the context of nonlinear coding. 

Finally, we note that here we used an abstract coding model that 
evaluates information based on the statistics C",C,L and so on. 
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For generality, we made no assumptions on the structure of these 
statistics, and any links among them. This suggests two questions 
for future work: whether an arbitrary set of such statistics is 
realizable in a constructive model of random variables, and 
whether there are any typical relationships between these statistics 
when they arise from tuned neural populations. As a preliminary 
investigation, we partially confirmed the answer to the first 
question, except for a "zero measure" set of statistics, under 
generic assumptions (data not shown). 

Experimental implications 

Recall that we observed that, in general, for a given set of tuning 
curves and noise variances, there will be a diverse family of noise 
correlation matrices that will yield good (optimal, or near-optimal) 
performance. This effect can be observed in Figs. 2, 3, and 5, as 
well as our result about the dimension of the set of correlation 
matrices that yield (when it is possible) noise-free coding 
performance (Theorem 4). 

At least compared with the alternative of a unique optimal noise 
correlation structure, our findings imply that it could be relatively 
"easy" for the biological system to find good correlation matrices. At 
the same time, since the set of good solutions is so large, we should 
not be surprised to see heterogeneity in the correlation structures 
exhibited by biological systems. Similar observations have previously 
been made in the context of neural oscillators: Prinz and colleagues 
[33] observed that neuronal circuits with a variety of different 
parameter values could produce the types of rhythmic activity 
patterns displayed by the crab stomatogastric ganglion. Consequent- 
ly, there is much animal-to-animal variability in this circuit [34], 
even though the system's performance is strongly conserved. 

At the same time, the potential diversity of solutions could 
present a serious challenge for analyzing data (cf. [26]). Notice, 
that, at least for the N = 3 cases of Figs. 2 and 3 for example, how 
much the performance can vary as one of the correlation 
coefficients is changed, while keeping the other ones fixed. If this 
phenomenon is general, it means that, in an experiment where we 
observe a (possibly small) subset of the correlation coefficients, it 
may be very hard to know how those correlations actually affect 
coding: the answer to that question depends strongly on all of the 
other (unobserved) correlation coefficients. As our recording 
technologies improve [35], and we make more use of optical 
methods, these "gaps" in our datasets will get smaller, and this 
issue may be resolved; further theoretical work to gauge the 
seriousness of the underlying issue is also needed. In the 
meanwhile, caution seems wise when analyzing noise correlations 
in sparsely sampled data. 

Finally, recall that the optimal noise correlations will always lie 
on the boundary of the allowed region of such correlations. 
Importantly, what we mean by that boundary is flexible. It may be 
the mathematical requirement of positive semidefinite covariance 
matrices - the loosest possible requirement - or there may be 
tighter constraints that restrict the set of correlation coefficients. 
Since biophysical mechanisms determine noise correlations, we 
expect that there will be identifiable regions of possible correlation 
coefficients that are possible in a given circuit/ system. Under- 
standing those "allowed" regions will, we anticipate, be important 
for attempts to relate noise correlations to coding performance, 
and ultimately to help untangle the relationship between structure 
and function in sensory systems. 

Methods 

In the Methods below, we will first revisit the problem set-up, 
and define our metrics of coding quality. We will then prove the 



theorems from the main text. Finally, we will provide the details of 
our numerical examples. A summary of our most frequently used 
notation is listed in Table 1. 

Summary of the problem set-up 

We consider populations of neurons that encode a stimulus ?by 
their noisy responses x,-. For simplicity, we will suppress the vector 
notation in the Methods Unless otherwise stated, most of our 
results apply equally well to either scalar, or multi-dimensional, 
stimuli. 

The mean activity or "tuning" of the neurons are described by 
fij(s) = E{Xi \s}. In the case of scalar stimuli, this corresponds to the 
notion of a tuning curve. For more complex stimuli, this is more 
aligned with the idea of a receptive field. 

The trial-to-trial noise part in x,-, given a fixed stimulus, can be 
described by the conditional covariance CJJ ■ = CO v [x, -,x t \s) (super- 
script n denotes "noise"). In particular C^ = var(x,|s) are noise 
variances of each neuron. 

We ask questions of the following type: given fixed tuning curves 
fi and noise variances C" t , how does the choice of noise covariance 
structure Cy, i # j affect linear Fisher information ipjin (see Section 
"Defining the information quantities, signal and noise correla- 
tions")? 

Besides the local information measure /p.lin that quantifies 
coding near a specific stimulus, we also considered global measures 
that describe overall coding of the entire ensemble of stimuli. 
These are /ole an d / m ut,G, described in Section "Defining the 
information quantities, signal and noise correlations". For these 
quantities, the relevant noise covariance is cov(x,,X/) = 
E{cov(x,-,x 7 |s)}. We overload the notation with C =cov(x,,X/) 
in these global coding contexts. The optimization problem can 
then be identically stated for /ole and 7 m ut,G- 

Defining the information quantities, signal and noise 
correlations 

Linear Fisher information. Linear Fisher information 
quantifies how accurately the stimulus near a value s can be 
decoded by a local linear unbiased estimator, and is given by 



/F,lin = V/(C")- 1 V AI . 



(5) 



In the case of a K dimensional stimulus the same definition holds, 
with 



8 Hi \ 

8s K 



(6) 



In order for /p.lin to be defined by Eq. (5), we assume C is 
invertible and hence positive definite: C">-0. It can be shown that 
TjTjJj, is the (attainable) lower bound of the covariance matrix of the 
error of any local linear unbiased estimator. Here the term lower 
bound is used in the sense of positive semidefmiteness, that is the 
ordering A)pB if and only it A — B)pQ. To obtain a scalar 
information quantity, we consider trfTp.lin) an d also denote this by 
^F.lin if not stated otherwise. 

Optimal linear estimator. To quantify the global ability of 
the population to encode the stimulus (instead of locally, as for 
discrimination tasks involving small deviations from a particular 
stimulus value), we follow [18] and consider a linear estimator of 
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the stimulus, given responses x: 

s='^2A i x i + s 0 =A T x+s 0 , (7) 

i 

with fixed parameters Ai and .So unchanged with s. The set of 
readout coefficients A that minimize the mean square error for a 
scalar random stimulus s, i.e. 



stimulus, one may also be interested in how well the stimulus could 
be recovered by more sophisticated, nonlinear estimators. Mutual 
information, based on Shannon entropy is a useful quantity of this 
sort. It has many desirable properties consistent with the intuitive 
notion of "information", and it we will use it to quantify how well 
a non-linear estimator could recover the stimulus. 

Assuming that the joint distribution of (x,s) is Gaussian (s can be 
multidimensional), the mutual information has a simple expression 



E{(s-s) 2 }, (8) 
can be solved analytically as in [18], yielding: 

A = F- l L, min(E{(S-.?) 2 }) = var(.5)-Z. r r- 1 L, (9) 

where 

(T)y = cow(xi,xj) = covins/) +E{co\( Xi ,Xj\s)} : = C + C, (10) 

and L is a column vector with entries L; = cov(x,,.s). Here the 
expectation E{-} generally means averaging over both noise and 
stimulus (except in ¥ I {co\(x i ,x J \s)}, where averaging is only over 
the stimulus). 

For multidimensional stimuli S, similar to the case for linear 
Fisher information, the lower bound (in sense of positive 
semidefiniteness) of the error covariance E{(s — s)(s — s) T } is 
given by cov(s,s) — L T T~ l L. Here L is extended to form a matrix 



/ cov(xi,si) ••• cow(x u s K )\ 



\COV(x N ,S]) ■■■ cov(x N ,s K )J 



(11) 



Furthermore, a corresponding lower bound for the sum of squared 
errors E{p— s\\ 2 } is the scalar version tr(cov(s,s)) — 
tr(L r r _1 I,). 

When minimizing the OLE error with respect to noise correlations, 
H h COv(.s,.s) and L are constants with respect to the optimization. 
Minimizing OLE error is therefore equivalent to maximizing the 
second term above, given by L T F~ l L. This motivates us to define 
what we call "the information for OLE", which is simply the second 
term (above) — i.e., the term that is subtracted from the signal 
variance to yield the OLE error. Specifically, 



I< X x = L T (C + C)- 1 L, 



or the scalar version tc(L' (C + C) X L). 



(12) 



Thus, when 7ole is large, the decoding error is small, and vice versa. 
Comparing with the expression for /p.lin, we see a similar 
mathematical structure, which will enable almost identical proofs of 
our theorems for both of these measures of coding performance. 

Similar to /F.lin, we need C + C to be invertible in order to 
calculate /ole- Since the signal covariance matrix C does not 
change as we vary C, this requirement is easy to satisfy. In 
particular, we assume C is invertible (C>-0), and thus for all 
consistent - i.e. positive semidefinite - C, C + C"^=C>-0, so 
that C + C is invertible. 

Mutual information for Gaussian distributions. While 
the OLE and the linear Fisher information assume that a linear 
read-out of the population responses is used to estimate the 



1 1 t i 

-?mut,G= ^iogdet(co\(s,s)) - -logdet(cov(.?,.5)-i E L) 



(13) 



= ^logdet(cov(.v,.?))- ^logdet(cov(.v,.?)-L r (C^ + C"r 1 Z/). 

The quantities above are the same as in the definitions of /ole- 
Moreover, log is taken to base e, and hence the information is in 
units of nats. To convert to bits, one must simply divide our / mu t,G 
values by log(2). 

There is a consistency constraint that must be satisfied by any 
joint distribution of (x,s), namely that 



cov(,m) - U (C + C n y 1 L = cov(. ? ,y|x)^0. 



(14) 



This guarantees that 7 mu t,G is always defined and real (but could be 
+ go). To keep 7 mu t,G finite, one needs to further assume 
co\(s,s)-L T (C + C"r l L>0, which is equivalent to O0. 
This can be seen by rewriting mutual information while 
exchanging the position of the two variables (since mutual 
information is symmetric), 



1, 



4iut,G = 2 1°S det(cov(x,x)) 



1 



log det(cov(x,x) — Lcov(s,s) X L T ) 



= ^ log det(cov(x,x)) — ^logdet(cov(x,x|^)) 

_ 1 det(cov(x,x)) 
= 2 ° g det(C) ' 
It is easy to see that the formula contains terms similar to those 
in /ole and /F.lin- In the scalar stimulus case, since log(-) is an 
increasing function, maximizing 7 mu t,G ls equivalent to maximizing 
7ole ■ In fact, the leading term in the Taylor expansion of / mu t,G 

L T (C tl + C)^ 1 L 

with respect to L r (C + C")~ L is , which is 

2var(s) 

proportional to /ole ■ In the case of multivariate stimuli 5, we note 
that the operation logdet(') preserves ordering defined in the 
positive semidefinite sense, i.e. .F)>=G=>logdet(,F)>logdet(G). 
This close relationship suggests a way of transforming 7rjLE to a 
comparable scale of information in nats (or bits) as 

- log det(cov(,v)) - - log det(cov(.5,.y) - 7 0 le)- 

Signal and noise correlations. Given the noise covariance 
matrix C" one can normalize it as usual by its diagonal elements 
(variances) to obtain correlation coefficients 



(15) 



We next discuss signal correlations, which describe how similar 
the tuning of a pair of neurons is. For linear Fisher information, we 
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define signal correlations as 



sig= V/'.-v'/'/ 



llVftlbllV^-ll 



(16) 



Here V ft = (^, 



'ds K 



) is the sensitivity vector describing how 



the mean response of neuron i changes with s. With the above 
normalization, p^ lg takes value between — 1 and 1 . 

For the other two information measures we use, /ole and 
-fmut.G, a similar signal correlation can be defined. Here, we first 
define analogous tuning sensitivity vectors A® for each neuron, 
which will replace Vp, in Eq. (16). These vectors are 

A° = (C+D n y l L and A° = (C> 1 + D")- 1 LM^i (17) 

for /ole and 7 m ut,G respectively. Here D" is the diagonal matrix of 
noise variances, and M = cov(s,s) — L T (C + C")~ l L. 

The definitions of signal correlations above are chosen so that 
they are tied directly to the concept of the sign rule, as 
demonstrated in the proof of Theorem 1 . As a consequence, for 
the case of /ole and / m ut,G, signal correlations are defined through 
the population readout vector. This has an important implication 
that we note here. Consider a case where only a subset of the total 
population is "read out" to decode a stimulus. Then, the 
population readout vector — and hence the signal correlations 
defined above — could vary in magnitude and even possibly 
change signs depending on which neurons are included in the 
subset. 

A different definition of signal correlations for OLE is 
sometimes used in literature, which we denote by 

P; ; lg = Cfj/ 1 CfjCfj. Naturally, one should not expect our sign 
rule results to apply exactly under this definition. However, when 
we redid our plots of signal vs. noise correlations using p^ s for our 
major numerical example (Fig. 5 ABC), we observed the same 
qualitative trend (data not shown). This reflects the fact that, at 
least in this specific example, the signal correlations defined in the 
two ways are positively correlated. Understanding how general this 
phenomenon is would require further studies taking into account 
how the relevant statistics (C, L, etc.) are generated from tuning 
curves or neuron models. 

We next define the notion of the magnitude or strength of 
correlations, as came up throughout the paper. In particular, in 
Section "Heterogeneously tuned neural populations", we consid- 
ered restrictions on the magnitudes of noise correlations when 
finding their optimal values. We proceed as follows. Since Py = Pj h 
the list of all pairwi.se correlations of the population can be 

N(N-l) 

regarded as a single point in IR 2 .If not stated otherwise, the 
vector 2-norm in that space (Euclidean norm) is what we call the 
"strength of correlations:" 



(18) 



Proof of Theorem 1 : The generality of the sign rule 

We will now restate and then prove Theorem 1, first for lyjjn 
and then for /ole and 7 mu t,G- 



Theorem 1. If for each pair of neurons, the signal and noise 
correlations have opposite signs, the linear Fisher information is greater than the 
case of independent noise (trial- shuffled data). In the opposite situation where 
the signs are the same, the linear Fisher information is decreased compared to 
the independent case, in a regime of very weak correlations. Similar results hold 
f or -^ole and 7 mut ;G) with a modified definition of signal correlations given in 
Section "Defining the information quantities, signal and noise 
correlations". 

The proof proceeds by showing that information increases along 
the direction indicated by the sign rule, and that the information 
quantities are convex, so that information is guaranteed to increase 
monotonically along that direction. 

Proof. Consider linear Fisher information 



i FM =tT(y P i J xc n r 1 Vn) 



(19) 



Let D" be the diagonal part of C, corresponding to (noise) 
variance for each neuron. We change the off-diagonal entries of 

, jV(JV-l) 

C" along a certain direction (C) in IR 2 and consider a 
parameterization of the resultant covariance matrix, with param- 

eter t: C"(t) = D n + (C '") t. We evaluate the directional derivative 



y of/ F ,iin at C = D n , 



4,ii„ = -tr(V/(I>V VOVr'Vp) 
= -tr((D n )-\C")'(D n )-\s/p.Vii T )) 
(C")>,-V^. 



(20) 



8p,j dfij 

Here V/^, = (- — , — ), and we have used the identity 
OSl osk 

iv(AB T )=Y Jii A. i jB i j and the fact dX dt @ = - X~ x tf X~ l . 
Recalling the definition of signal correlations in Eq. (16), if the 
sign of (C"j) is chosen to be opposite to the sign of p^' g for all i¥=j, 

then Eq. (20) ensures that the directional derivative l^n a >0 at 
C = D". 

We now derive a global consequence of this local derivative 

calculation. 7p i; n as a function of t has — > 0|, = n. Since 7p lin is 

(// 

smooth, there exists <5>0, such that for £e[0,<5], — F ^'"^ - >0. For 
corresponding C(t), applying the mean value theorem, we have 

/F,lin(C' I (0)-/ F ,lin(£'") = ? fi ^ li,1 l (iel o^]>0. Similarly, for the op- 
posite case where all the signs of the noise correlations are the 
same as the signs of p^' 8 , the information will be smaller than the 
independent case (at least for weak enough correlations). This 
proves the local "sign rule". 

Thus, at least for small noise correlations, choosing noise 
correlations that oppose signal correlations will always be yield 
higher information values than the case of uncorrelated noise. To 
prove the "global" version of this theorem — that opponent signal 
and noise correlations always yield better coding than does 
independent noise — we will need to establish the convexity of 
^F.lin- This is done in Theorem 2. 

Note that, as we will soon prove, /p.lin is a convex function of t, 
dip ij n 

and hence — i — is increasing with t. This means that the b from 
dt 8 
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our prior argument can be made arbitrarily large, and the same 
result - that performance improves when noise correlations are 
added, so long as they lie along this direction - will hold, provided 
that C"(S) is still physically realizable. Thus, the improvement 
over the independent case is guaranteed globally for any 
magnitude of noise correlations. 

Note that the arguments above do not guarantee that the 
globally optimal noise correlation structure will follow the sign 
rule. Indeed, we have seen concrete examples of this in Figs. 2 and 
Fig. 3. 

Remark 1 . From Eq. (20), the gradient (steepest uphill direction) 
of If lin evaluated with independent noise C = D" is 

V 'V {D") U (D% (D") U (D% P 'J- 

Remark 2. The same result can be shown for /ole and 7 m ut,G> 
replacing Vfi with A 0 = (C" + D") -1 L and A° = (C + 
D")~ l LM~T-, respectively, in the definition of p slg in Eq. (16). 
The gradients are — 2A®-Aj and —A^-Aj, respectively, where A® 
is i-th row of A 0 , and M = cow(s,s)-L T (O l + C n y l L. 

Proof of Theorem 2: Optima lie on boundaries 

We begin by restating Theorem 2, which we then prove first for 
7ole and then for 7 F ,lin and / m ut,G- 

Theorem 2. The optimal C" that maximize information must lie on the 
boundary of the region of correlations considered in the optimization. 

We will show that /ole is a convex function of C" and hence it 
will either attain its maximum value only on the boundary of the 
allowed region, or it will be uniformly constant. The latter is a 
trivial case that only happens when L = 0, as we see below. 

Proof. To show that a function is convex, it is sufficient to show 
its second derivative along any linear direction is non-negative. For 
any constant direction (C) =B of changing (off-diagonal entries 
of) C", we consider a straight-line perturbation, C n (i) = C + tB 
parameterized by t. Taking the derivative of /qle with respect to t, 



/ OLE = -tr(Z/ (C + C(t)y l B(C + C(i)T l L). (21) 

We have used that dX = -X~ l x" X~\ Let A = (C + 
C n (i))~ x L. Taking another derivative gives 

/q LE = 2tr(A T B(O l + C n y 1 BA)>0. (22) 



Proof. For any vector z (with the same dimension as the number 
of columns in G), z T G T FGz = (z T G T )F(Gz)>0 since F)pO. Thus, 
by definition, G T FG)pO, and therefore Xx{G T FG) > 0. 

For the second part, if tr(G T FG) = 0, all the eigenvalues of 
G T FG must be 0 (since none of them can be negative as 
G T FG)p% hence G T FG = 0. This in fact requires FG = 0. To see 
this, let U T AU = F be an orthogonal diagonalization of F. For 
any vector z as above, z T G T FGz = 0. Since the eigenvalues A,-,- are 

non-negative, let A 5 be the diagonal matrix with the square roots 
of A„ . We have 

Q = z T G T U T AUGz = (AWGz) T (AWGz) = \\AWGzf 2 . (23) 

Therefore the vector AWGz = Q and FGz= C/ r A2A2 [/Gz = 0. 
Since z can be any vector, we must have FG = 0. 

Remark 4. Because of the similarities in the formulae for 7rjLE 
and 7 F> ii n , the same property can be shown for Tpjin- In order for 
C" to be invertible, /p.lin is only defined over the open set of 
positive definite C". We therefore assume the closure of the 
allowed region is contained within this open set C">-0 to state the 
boundary result. 

A parallel version of Theorem 2 can also be established for 
^mut.G, as we next show. 

Proof of Theorem 2 for 7 mut; G- Again consider the linear 
parameterization C"(t) along a direction B, as defined above. 
Let M = co\(s,s)-L T (C^ + C"(t)y l L. The consistency con- 
straint in Eq. (14) assures M^=0. To keep 7 mu t,G finite, we further 
assume M>»0. Then, the derivative of 7 mu t,G with respect to t is 



i mut,G ; 



ldet(M)tr(A/- 1 Jl/) 
2 det(M) 



\x{M- x M), (24) 



where we have used the identity (det(M)) =det(M)tr(M 1 m'). 
The second derivative is thus 



4t,G=Jtr(M- 1 M / M- 



M*)— - \r(M~ x m") 



= ltr(M _1 M '-FM- 1 M ') 



The inequality is because of Lemma 7 (see below) and 
(C + C n Y 1 being positive definite. Also, note that (BA) T = A T B. 

For the case when 7ole is constant over the region, using 
Proposition 10 (below), A4 = 0 for any direction of change B. 
Letting By = &i p &j q , p^=q,we see that the p,q-th row of A must be 
0. This leads to A = 0 and, since A = T~ l L, to L = 0. This was the 
claim in the beginning. In other words, in the case where /ole is 
constant with respect to the noise correlations, the optimal read- 
out is zero, regardless of the neurons' responses. With the 
exception of this (trivial) case, the optimal coding performance is 
obtained when the noise correlation matrix lies on a boundary of 
the allowed region. 

Lemma 7. (Linear algebra fact) For any positive semidefinite matrix F , 
and any matrix G, G T FG (assuming the dimensions match for matrix 
multiplications) is positive semidefinite and hence tr(G T FG)>0. If"-" is 
attained, then FG = 0. 

Remark 3. When F>-0 i.e. positive definite, tl(G T FG) = 0 leads 
to G = 0 as F is invertible. 



+ tr(M'2L T r- ] BT- l Br- 1 LM^) 



>0. 



Here / is the identity matrix, M* = ifT^ 1 BT~ X L, hi" = -2L T 
Y- x BY- x BY- l L and r=C" + C" as defined below Eq. (9). M 
being positive definite allows us to split it into its square root 

M = MT-M7.. Moreover, the identity tr(PQK) = tr(QRP), for any 
matrices P,Q, and R, is used in deriving the last line in the above 
equation. For the last inequality, we apply Lemma 7 to the two 
terms with / and being positive semidefinite. 

We have thus shown that 7 mu t,G is convex. For the special case 
that /mut.G is constant, Proposition 10 shows i?r _1 L = 0. With the 
same argument as for 7ole, we observe that, in this (trivial) case 
L = 0. 
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Proof of Theorem 3: Conditions on the noise covariance 
matrix, under which noise-free coding is possible 

We begin by showing that, for a given set of tuning curves, the 
maximum possible information - which may or may not be 
attainable in the presence of noise - is that which would be 
achieved if there were no noise in the responses. This is the content 
of Lemma 8. Next, we will introduce Lemma 9, which is a useful 
linear-algebraic fact that we will use repeatedly in our proofs. 

We will then prove Theorem 3, which provides the conditions 
under which such noise-free performance can be obtained. One 
direction of the proof of Theorem 3 (sufficiency) is straightforward, 
while the other direction (necessity) relies on the observation of 
several conditions that are equivalent to the one in the theorem. 
We prove these equalities in Proposition 10. 

For Theorem 3, we will only consider /ole, since /pjin an d 
-fmut.G will typically be infinity in the noise-free case (C becomes 
singular). If one takes all instances of infinite information as 
"equally optimal," a version of Theorem 3 can also be obtained; 
moreover, the condition in Theorem 3 becomes a sufficient but 
not necessary condition for infinite information. 

Lemma 8 (Upper bound by noise-free information). 

/ole(C")</ O le(0). (25) 

Here the noise-free information /ole(0) refers to that which is obtained when 
plugging in 0 at the place of C" in Eq. (12). 

Proof. This follows essentially from the consistency between the 
information quantity and the positive semidefinite ordering of 
covariance matrices. First, we write 

WO) - /ole(C) = tr(L T [( C) ~ 1 - (C + C") " 1 ] L). (26) 

Then, we note the fact that for two positive definite matrices F,G, 
F)pG if and only if F _1 =^G -1 . From this, we have 
(C^-'-CC + C") -1 ^). Finally, applying Lemma 7 yields 
/ole(0)-/ole(C")>0. 

Lemma 9 [Useful linear algebra fact) . If for any F, G, and 
M, GF- l M = 0, then{F+Gy l M = F- l M. 

Proof. (F+G)F- l M = M+GF- l M = M. 

Proposition 10. (Equivalent conditions used in proving the 
noise-free coding Theorem 3). 

Along a certain direction (C) = B, the following conditions are equivalent. 

(f)/ OLE (C") = 0 (ii)B(C» + C"r l L = 0 

(iii)l0LE{C" + tB) = l0LE(C). 

The same also holds for /p.lin ind / mu t,G- 
Proof for Tole- "(i)o(ii)": 

We again consider parametrized deviations from C", 
C(t) = C" + tB for some constant matrix B. Let 
A t = (C + C" + tB)- l L, and recall (Eq. (22)), 

7o LE (C) = 2trL4 0 r 5(C + CT'BAo). (28) 

Since C + C is positive definite, according to the remark after 
Lemma 7, we have (i)o(ii). 

"(ii)->(tii))": If (U), by Lemma 9, A,=A 0 . We have 
f OLE (C n + tB) = -tr(AjBA,)= -tr(A^BA 0 ) = 0, for all t in the 
allowed region, and hence (Hi). 

"(iii)=>(0": immediate. 
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This concludes the proof for /ole- 

Proof for /f,Hii- For /p.lin, we further assume C">0 to avoid 
infinite information. Identical arguments will prove the properties 
above, where (if) is replaced by B(C n )~ l V fi = 0. 

Proof for 7 mut) g- For / mu t,G, we similarly assume M^O (as defined 
in the proof of Theorem 2). Let A, = (C + C + tBy l L, then 
M'\ t=0 = A^BA 0 , 

W(C")= \^(M- X M M- X M) 

+ tr(M~ l 2Al Br- 1 BA 0 M~T-). 

It is easy to see (ff)=>(f). When (i) holds, using Lemma 7, each of the 
two terms must be 0. In particular, as we discussed in the proof of 
Theorem 2 for 7 mu t,G (above), each of the terms is non-negative. 
Thus, if their sum is 0, then each term must individually be 0. 
According to the remark after Lemma 7, the second term being 0 
_1 

indicates that BA 0 M 2 = 0 or BA 0 = 0, which is (ii). 

If (if) holds, by Lemma 9, we have A t = A$. We have 

4ut,o(C"+^)= - l -tr(M- l AfBA t )=- l -tr(M- l A^BA 0 ) = 0, 

for all ? in the allowed region, and hence (til). Similarly (iii)^>(i). 
This proves the property for 7 mu t,G- 

Theorem 3. A covariance matrix C attains the noise-free bound for OLE 
information (and hence is optimal), if and only if CA = C((T)~ l L = 0. Here 
L is the cross-covariance between the stimuli responses (Eq. (11)), O 1 is the 
covariance of the mean response (Eq. (10)), and A is the linear readout vector for 
OLE, which is the same as in the noise-free case — that is, 
A = (C"+O l ) l L = (0") l L — when the condition is satisfied. 

Proof. If C"(CT 1 L = 0, then Lemma 9 implies that 

(C> i + C n r 1 L = (C> i y l L, which means that / 0 le(C) = 
^ole(O), using the definition in Eq. (12). 

For the other direction of the theorem, consider a function of 
fe[0,l], I OL E(tC n ) = tr(L T (C^ + tC n )- 1 L), whose values at the 
endpoints are equal, according to saturation of the information 
bound. The mean value theorem assures that there exists a ?ie[0,l] 
such that 

/ ole( ? i c ")= -tr(L T (CX + hC")- 1 C n (C" + t 1 C"r l L) = 0. (29) 

Since C" is positive semidefinite, according to Lemma 7, 
C n (C + tiC"y l L = 0. Now using Lemma 9, we have that 
C"(C) _1 L = 0, and the readout vector A = (C + C n y x L = 
(C^L. 

Proof of Theorem 4: Conditions on tuning curves and 
variance, under which noise-free coding performance is 
possible 

Next, we will restate, and then prove, Theorem 4. The proof 
will require using geometric ideas in Lemma 1 1 , which we will 
state and prove below. 

Theorem 4. For scalar stimulus, let q, = \J A\ C" h i = 1---M, where 
A = (0') l L is the readout vector for OLE in the noise-free case. Noise 
correlations may be chosen so that coding performance matches that which could 
be achieved in the absence of noise if and only if 

1 N 

max{ qi }<-J2<ii (1) 

z ;=i 

February 2014 | Volume 10 | Issue 2 | e1 003469 



16 



The Sign Rule and Beyond 



When "< " is satisfied, all optimal correlations attaining the maximum form a 
N<,N 2 ^ dimensional convex set on the boundary of the spectrahedron. When 

" = " is attained, the dimension of that set is N °^ N o + i ) ^ where jV 0 is the 
number of zeros in {q,}. 

The proof is based on the condition in Theorem 3. After taking 
several invertible transforms of the equation, the problem of 
finding a noise-canceling C" is transformed to that of finding a set 
of N vectors, whose length are specified by qi, that sum to zero (the 
vectors form a closed loop when connected consecutively). This 
allows us to take a geometrical point of view, in which inequality 
Eq. (1) becomes the triangle inequality. This will prove the 
"necessary" part of the Theorem. Lemma 0.5 shows the opposite 
direction, by inductively constructing the set of vectors that sum to 
zero. 

This procedure will yield one "particular" C" with the noise- 
canceling property. Very much like finding all general solutions of 
an ODE, we then add to our particular solution an arbitrary 
homogeneous solution, which belongs to a vector space of 
N(N — 3) 

dimension . In order for our perturbed solution, at least 

for small enough perturbations, to still be positive semidefinite, the 
particular C" we start with must be generic. In other words, it 
must satisfy a rank condition, which is guaranteed by the 
construction in Lemma 1 1 . We can then conclude that the set of 
all noise canceling C forms a linear segment with the dimension 
of the space of homogeneous solutions. 

Finally, special treatments are given for the cases of " = " in Eq. 
(1), as well as cases where some q/s are 0. 

Proof. To establish the necessity direction of the Theorem, first 
let D be a diagonal matrix with Dff=Ai or A = De, where vector 
e = (l,- -,l) r . Note that 



C n A = 0^DCDe = 0 



(30) 



Let DC D = C" , a positive semidefinite matrix with diagonal 
{«?_}• 

C can be diagonalized by an orthogonal matrix U, 
C" = U T A U. Without loss of generality, further assume that the 
first k diagonal elements of A are positive, with the rest being 0, 
where & = rank(C"). Let A be the first k block of A, and U be the 
first k rows of U. Then we have 



U T A Ue = 0^ AUe = 0^ A2 tie = 0. 



(31) 



Let B = A2U, a kxN matrix, and B, be the i-th column. As 
C" = B T B, the 2-norm of vector Bi is q\. Let qj be the maximum of 
{«/}, 



Be = 0~- 



■■0=>-B, 



>*E 



(32) 



This concludes the necessary direction of our proof. 

To establish sufficiency, we first focus on the case of "<" and 
all A^O. We will construct a generic C" that has rank N—l, 
satisfying CA = 0. We will basically reverse the direction of 
arguments in Eq. (30-32). We will later deal with the " = " case, 
and the case of A, = 0 for some (. 

Lemma 11 Let Cj, i=l, ■ ■ ■ , N — 1 be an orthonormal basis of 
R^ -1 . Given a set of positive i satisfying "< " in Eq. (1), there exist 

N vectors {B,}, such that ^^, = 0, \\B i \\ 2 = qi and the spanned linear 
subspace span{5,}^ =1 =span{e,}^~j 1 . 



Proof. We prove this by induction. TV has to be at least 3 for the 
inequality to hold. For N = 3, this is the case of a triangle. There is 
a (unique) triangle X\X 2 X^, for which the length of the three sides 
X1X2, X 2 Xj, X\Xt, are qi,q\,q 2 respectively. The altitude from X3 
intersects the line of ^1X2 at O. Let O be the origin of the 
coordinate system, with X1X2 being the x-axis and aligned with e 2 , 
and the altitude OX3 being the y-axis aligned with e\ . From such a 
picture, it is easy to verify the following: B3 = — (\X\ 0\ + 
p\X 2 0\)e 2 , Bi = \XiO\e 2 + \OX 3 \ei, B 2 =p\X 2 0\e 2 -\OX 3 \e l sat- 
isfies the lemma, where p = 1 if O lies within X\ X 2 and p = — 1 
otherwise. 

For the case of N>4, assume that q^ is the largest of the q's. 
Because of the inequality, there will always exist some non- 
negative real number q (not necessarily one of the <7,'s) such 
that 



max{q N -q N _ u q u - ■ ■ ,q N - 2 } <?<min{ ^ q h q N + q N _ l }(33) 

i=l 

We can verify that the set {q\, ■ ■ ■ ,qN-2,q} satisfies the 
inequality as well. By the assumption of induction, there exist 
vectors {B\, ■ ■ ■ ,Bn_ 2 ,B} that span the space of {e\, ■ ■ ■ ,e^/_ 2 }, 
such that \\Bi\\ 2 = qi and ||-B||2 = <7. 

Note the choice of q also guarantees that i,<?} can be 

the edge lengths of a triangle. Applying the result at N = 3, the 
three sides X\X 2 , X2X3, X\X^ correspond to q,qN,qN-\ 

respectively. Let -B/y-i = ~ „ J. - — . B+\OX$\en-\, 



Bn — P 7 



\XiO\ +p\X 2 0\ 
B—\OX^\eN_\. It is easy to verily that 



\x 2 o\ 

■ \X l O\+p\X 2 0\' 
these {Bj}f =l satisfy the lemma. 

Using the lemma, we have a set of Bj. Stacking them as column 
vectors gives a matrix B; moreover, Be = 0. Let C" = B T B, which 
is positive semidefinite with diagonals {qj}. It is easy to show that 
rank(C") = rank(i?) = A r — 1, by comparing the null spaces of the 
matrices. Let C = D~ 1 C n D~ l , where D is defined as above. 
Then C n A = D- l C"D- l A = D- l C n e = 0. 

Now consider the case where there are zeros in A,. Assume that 
the first k entries contain all of the the non-zero values. We apply 
the construction above for the first k dimensions, and get a kxk 
matrix such that C"A = 0, mnk(C n ) = k— 1, where A is part of A 
with the first k elements. The following block diagonal matrix 



I ^j, where £« = diag{q +u+1 , • • • C^},(34) 

satisfies C"A=0 and rank(C) = N- 1. 

We have shown that for the " < " case in the theorem, there is 
always a noise canceling C. Consider the direction (C) , in 
which off-diagonal elements of C vary, while keeping 
(C" + (C") )A = 0 (temporarily ignoring the positive semidefinite 
constraint). The set of all such (C) form a linear subspace M of 

jV(jV-l) ; 

R 1 , determined by the linear system (C) A = Q. Since there 
are N equations, the dimension of M is at least 
N(N-l) N(N-3) 

2 2 
In the " < " case, there must be at least 3 non-zero A/s in order 
for the triangle inequality to be satisfied in Eq. 1. We will choose 
these three A/s to be A\,A 2 ,Am #0. Consider a block of the 
coefficient matrix associated with the system (C) A = 0 (note that 
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the entries of (C) are considered to be unknown variables), that 
are columns corresponding to variables (C") 12 , • • • ,(C") 1N ,(C) 2N 



(A 2 A] 
Ax 



0 



••• A N 0 ^ 
An 

0 0 



Ax Aj 



(35) 



Performing Gaussian elimination on the columns of this matrix, 
we obtain the following matrix, which will have the same rank. 



/ A 2 A, 



2A 2 A N \ 

Ax 
0 



0 



0 



(36) 



This matrix - which determines the number of constraints that 

/ 

must be satisfied in order for (C ) ^4=0- has rank N, and hence 

a- ,*r\- , N(N-l) Ar N(N-3) 
dim(M) is exacdy — ^ —N= . 

For any direction in M, we can always perturb the generic C 
we found above by some finite amount £, and still have C + e(C) 
be positive semidefinite. Let Amin be the smallest non-zero 

1 ) ■ 

eigenvalue of C". Take any lei < ; — . For any vector z, 

2 K c) \\ 2 

let z = Zq+Zx be an orthogonal decomposition where 
z o = TTaAA T z is the projection along the direction of A. Then 



z T (C n + e(C n ) )z = z[C n zx+ez[(C n ) z x 

ni- 



^Amfallzilll-NHzillilKn || 2 >0. 



(37) 



This shows that the C" + £(C") are positive semidefinite and 
they form a set of dimension as M. We can always take the 
admissible e values to their extremes, and the resulting matrices 
are all the possible noise canceling C" . For any C"A = 0, 
(C"-C")A = 0, and C"-C n must be in M. Note that the sets 
of positive semidefinite C (spectrahedra) are convex. As a 
consequence, any point along the segment C + f(C — C) will 
be positive semidefinite. This shows we must have encompassed 
C" when considering the largest possible perturbations of C, in 
any direction (C) eM. Moreover, we note that the set of all noise- 
canceling C" is convex: if C?A = 0, fe{l,2}, {kC[ + 
(\-X)C n 2 )A = Q for any Ae[0,l] and (AC? + (1 -A)Cf) is positive 
semidefinite, with the diagonal matching C"-. 

Thus, we have proved the claim about the dimension and 
convexity of the set of optimal correlations for the case of "<" in 
Eq. (1). 

Finally, for the special case of " = " in Eq. (1), again first 
consider the case where all Aj^O. As before, solving C"A = 0 is 



equivalent to solving C"e = 0 and there is an one to one 
correspondence between the two. Revisiting Eq. (32) in the proof 
above, the equality condition in the triangle inequality implies that 
{Bj,i= 1, • • • ,N — 1} all point along the same direction, and that 
Bff is in the opposite direction, in order to cancel their sum. This 
fully determines C" = Z>«C°D*, where D« = diag{^i , • ■ ■ q N }, and 



/ 1 



1 

V-i 



1 



1\ 



1 -1 
1 1 -1 

-1-1 I J 



(38) 



It is easy to verify that Ce = 0, and hence there is a unique noise 
canceling C. 

For the case when there are Nq 0's among the {Aj}, assume that 
the first N — No coordinates are non-zero, so that 
A = (A,0, ■ ■ ■ ,0) r . Next, we write C"A = 0 in block matrix form, 
with blocks of dimension N — Nq and Aq: 



CA-- 



C" E T 
E F 



CA 
EA 



(39) 



Applying the previous argument from the A t # 0 case, there is a 
unique C". Moreover, note that rank(C) = 1, following from the 
fact that C° in Eq. (38) has rank 1. Let C" = U T AU be the 
orthogonal diagonalization and A.n-n 0 ,N-N 0 =A#0. Let In 0 be 
the identity matrix of dimension Nq. Then we can take an 
orthogonal transform: 



U T 0 
0 I No 



C" E T 
E F 



U 0 



0 I, 



N 0 



U T 0 
0 I No 



A 



A U T E T \ U T A 
EU F ) \ 0 

With the notation U T A=A , the original problem CA = 0 is 
therefore equivalent to finding all E and F such that, 



A E' T 
E' F 



=0, 



(40) 



while keeping the matrix in this equation positive semidefinite. 

For any positive semidefinite matrix X, it is easy to show that 
XuXjj>Xfj by considering the principle minor with indices ij, 
which must be non-negative. Note that since A has only one non- 
zero diagonal entry, this forces the first N — No — 1 columns of E? 
to be entirely 0. So we can rewrite the block matrix by dimension 
N-No-l and A 0 +l as 



A E 



<T 




(41) 



E F 

where e is the (N — Ao)-th column of E . Since 
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AA = A(A )jv_/^ 0 =0, we have (A )^-n 0 = 0- ^ can be verified 
that, as long as the block structure of Eq. (41) is satisfied, Eq. (40) is 
always true. The positive semidefinite constraint becomes the 
constraint that the lower block be positive semidefinite; in turn, 
this corresponds to a spectrahedron (and hence a convex set) of 
N 0 (N 0 +\) 

dimension . Note that this dimensionality and convex- 
ity will be preserved when we undo the invertible linear transforms 
performed in prior steps to obtain the noise-canceling C"s. 

Proof of Theorem 5: Probability that noise-free coding is 
possible 

In this subsection, we will restate, and then prove, Theorem 5. 

Theorem 5. If the {qi} defined in Theorem 4 are independent and 
identically distributed (i.i.d.) as a random variable X on 
[0,oo) with 0<E{X}< 00 then the probability 



Proof of Proposition 6: Sensitivity to perturbations 

Here, we will prove Proposition 6, which puts bounds on the 
condition numbers that define the sensitivity of our coding metrics 
to perturbations in noise correlations or the tuning curves. For our 
proof, we will require three different lemmas. We state and prove 
these, before moving on to Proposition 6. 

Here, we will first consider the condition number for the case of 
a scalar stimulus S, when L is a vector. In the proof of the 
proposition, we show how to extend the results to the case of 
multivariate s. As we mentioned in Section "Sensitivity and 
robustness of the impact of correlations on encoded information", 
the same proof works for /pjin as well as /ole- 

Lemma 12. For any submultiplicative matrix norm ||-|| and 
Mil < 1/2, 

||(/-^)- 1 ||<2. (45) 



P(the inequality of Eq.{\) is satisfied)-* I, as N->co (2) 

Proof. We will use the following fact to establish a lower bound 
for the probability of the event in the theorem (below, we denote 
this event as C). 



P(AC\B) > P(A) + P(B) - 1 



(42) 



We choose the two events A and B as A = ^ q, > |E{X} and 
B = msLx{qi}< f E{X}. Note that A(~)B implies C, 



Proof. Since \\A\\ < 1/2, (I -A) 1 exists and 



CO CO 



w-a)- 1 ii=ii5>i<£ Mir =d - Miir 1 <i. (46) 

H=0 H=0 



Lemma 13. For any positive definite matrix A, vectors I and a such that 
\\a\\ 2 \\Af 2 \\A-'i<\\l\\ 2 , 



maxte}<^E{X}<^E{Jf}<i^>, (43) 



the event in concern. We will then show that, for large 
populations, P(A)^\ and P(B)^\, and thus P(C)> 
P(AC\B)^\. 

For A, by the law of large numbers, the average should 
converge to the expectation (which is a positive number), hence 



(44) 



\(l + a) T A- l (l + a)-l' A~ l l\ 



- 1 ii i ii 4ii2 H fl ll2 



\l T A-H\ 



<-$\\A- l \\\\A^ 



(47) 



Proof. 



\{l + a) T A- ] (l + a)-l T A- 1 l\<2\a T A- 1 l\ + \a T A- l a\ 



= 2\a T A~lA~'il\ + \a T A- 1 a 



We next consider event B. Let the cumulative distribution 
function of X be F(x). Then cumulative distribution function for 
max{g,} is F N (x) by the assumption that these variables are 
drawn i.i.d. It follows that 



p(max{ qi }>*E{X}^ = 



NF N - i (x)dF(x) 



Ax 
EJX} 



F N -\x)dF(x)< 



xdF(x). 



<2||«|| 2 p4|y^4/|| 2 + || a ||2p-i|| 2 



1Mb 



--2\^\\A-2\\ 2 \\A-2l\\i^± 



+ 



II'"* \\A-2l\\ 



Here, the first inequality is obtained via the lower bound of x 
over the interval of integration, and the second uses the fact 
F(x)<l. 

As 7V->co, the last integral converges to 0 because of the fact 

that E{X}<oo, together with the Lebesgue dominated conver- 

/ N \ 

gence theorem. Hence Pi maxj^} < — E{X} J ->1 as N->co. 

Combining the limits of A and B using Eq. (42), together with 
the fact C^AC\B, we conclude that P(C) must approach 1 as 
iV->oo. 



Ilalli i i t i 

<2^\\A-2\\ 2 \\A-2l\\ 2 2 \\A2\\ 2 



\ah r A. 
|/|| 2 



i 



-i 



+ [^\\A2\\ 2 ) \A-V\i\\A- 1,2 



<3\\A- l \\l\\A\\\^\l T A- l l\. 
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l i l 

Here, we have used \\A~1\\ 2 = \\A~ l III, \\A~^l\\l = \l T A- l l\, 

and the assumed condition in the last line. 

Lemma 14. For any positive definite matrix A, vector I and matrix B 

where \\A~ l || 2 ||-B|| 2 < 1/2, 



|/ r ^- 1 /-/ r (^ + 5)- 1 /| = |/ r (^ + 5)- 1 5^- 1 /| 
= \l T A' l 2(I + A' l 2BA'2)- x A^BA' l 2A'h\ 



<ll^4|| 2 ||(/ + ^4 J s^4)- 1 || 2 ||^4|| 2 || jB || 2 |^4|| 2 ||^4/|| 2 

= (/^- 1 /)M- 1 || 2 || J B|| 2 ||(/ + ^4 J B^4)->|| 2 



<2(/ r ^- 1 /)M- 1 || 2 || J B|l2- 

-I , J _i _i 

Here we have used \\A 2|| 2 = ||^ _1 \\\. As \\A 2BA 2|| 2 < 

|^4~' 1| 2 ||-S|| 2 < 1/2, we apply Lemma 12 is applied to obtain the 
last line. 

Proposition 6. The local condition number of Ip ; u„ under perturbations 
of C (where magnitude is quantified by 2-norm) is bounded by 



KF>lin:C „<2 K2 (C"): = 2||(C")- 1 || 2 -||C' , || 2 



2/L 



. (3) 



where "k max and X ml -„ are the largest and smallest eigenvalue of C respectively. 
Here k 2 is the condition number with respect to the 2-norm, as defined in the 
above equation. 

Similarly, the condition number for perturbing ofVfi is bounded by 



KF,lin:V,u < 3 V K 2 (C n )K — , 

mm, (Vju) %i 2 



K is the dimension of the stimulus s. 
Proof. Note that 

K 

I FM = tr(V t i T {CT l V l i)= Y t efyfi T (C n r 1 Wne i , (49) 
i=i 

where e,- = (0, • • • , 1 , • • • ,0) T is the i-th unit vector (K x 1). Since 
the bound in Lemma 14 does not depend on /, we apply the 
Lemma for / = Wfie, ■ = (Vfi). , and each i respectively. For any 

perturbation B satisfying \\(C")~ l 1| 2 ||5|| 2 < 1/2, we have 

|/F,lin(C)-/ F ,lin(C+ J B)| 



< 2k 2 -. 



II5II 



C\\ 



Here k 2 = ||(C") _1 1| 2 -|| C\\ 2 . We then note that for positive 
semidefinite matrices ||C"|| 2 = A ma x, ll(C") _1 1| 2 = A lni [ 1 , where A m ; n 



and l max are the smallest and largest eigenvalues of C". This 
proves the bound on the condition number for perturbing C. 

Similarly, for a perturbation of V,u with v, ||v|| 2 ||y4|||||yl _1 1| 2 < 
min,-||Vjue,|| 2 . This guarantees that 



II 11 

l|2|| J -I II 2 ll„IL II*. II .11 A l|2|| A-l l|2. 



Hv^ibMiiiM-'iii^iiviijiicibMiiiM-'ili^llv^lb. (50) 

Applying Lemma 13 for each ve ; and V/xe,, we have 

|/F,lin(V^)-/F,lin(VA' + v)l 
^F,lin(VjU) 

^ 1 1 1 — ll ve ill 2 Tn 77,^>k\-1y7 

*ft^£ 3 ^iv^ e ' v ' ,(c) Vlie ' 

1 

= 3VK 2 " j 

<3VK2~- 



'mm,||V l iie,-|| 2 

l|V^|| 2 ||v|| 2 



•min,-||V/ie,-|| 2 ||VM|| 2 

|V/4g N2 
1 min t -||VMeill 2 HV^Ib 



<3,/^ 



\/Zmax,||V^e,-|| 2 ||v|| 2 
min,-||Vne,-|| 2 ||V|t|| 2 ' 



Here | ■ \\ F is the Frobenius norm and we have used the fact for 
any matrix G, ||G|| 2 < \\G\\ F . The last inequality follows from the 
definition of | • \\ F . 

Details for numerical examples and simulation 

Here, we describe the parameters of our numerical models, and 
the numerical methods we used. 

Parameters for Fig. 1, Fig. 2 and Fig. 3. All parameters we 
use are dimensionless, unless stated otherwise. 

In Fig. 1, the mean response for the three neurons under 
stimulus 1 (red) and 2 (blue) is fii and fi 2 respectively: 



(51) 



For each case of correlation structure (i.e., for each row in 
Figure 1), the noise covariance matrix is the same for the two 
stimuli, and all neuron variances CJ! = 1 . In detail: 



(4) 


/1.48X 




f0.52\ 




W= 1.16 


n 2 = 


0.84 


Here 


VI 16/ 




^ 0.84 j 



1 0.3381 0.3381 \ 
0.3381 1 0.1127 
0.3381 0.1127 1 



(52) 
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(53) 



The confidence circles and spheres are calculated based on a 
Gaussian assumption for the response distributions. 
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In Fig. 2, the noise variances C" t are all set to 1 . Additionally 
'0.8 



C = 



0.8 



0.8 . 



0.0310' 
0.4012 . (54) 
0.0406 , 



In Fig. 3, the noise variances Cjj are all set to 1 . In panel A 
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(55) 



For panel B 



( l ) 






C= 1 




3 


v J 




W 



(56) 



Heterogeneous tuning curves. For the results in Section 
"Heterogeneously tuned neural populations", we use the same 
model and parameters as in [8] to set up a heterogeneous 
population with tuning curves of random amplitude and width. 
For completeness, we include the details of this setup as follows: 

The shape of each tuning curve (specifying firing rates) is 
modeled by a von Mises distribution. This an analog of the 
Gaussian distribution over the unit circle: 



n(ff) = *■< + ft exp( 7i [cos(0 -«-!]). 



(57) 



The parameters ft,}';, and 0,- respectively control the magni- 
tude, width and preferred direction for each neuron. We set ^ to 
be equally spaced along [0,2ft] and tx, ■ = 1 . The ft are indepen- 
dently chosen from a j;-square distribution with 3 degrees of 
freedom, scaled to a mean of 19. y, is similarly drawn from a log- 
normal distribution with parameters giving mean 2 and standard 
deviation 2 (for the underlying normal distribution). 

We assume Poisson firing variability, so that (C") u = 
E{var(x,|.s)} = E{^,} = rE{r,}, and use a spike-count window 
T= 100 ms in Fig. 4 and 5. 

Equivalence between penalty functions and constrained 
optimizations. In this section we note a standard fact about 
implementing constrained optimization with penalty functions — 
i.e., the method of Lagrange multipliers. 

Consider an optimization problem: max x /(x). Now add a 
penalty term p(x) with constant Xq and consider the new 
optimization problem: max x /(x) — Xqp(x). If xq is one of the 
solutions to this new optimization problem, then it is also an 
optimal solution to the constraint optimization problem 
max xAx)=p(x<l) f(x). 

To show this, let x\ be any point that satisfies p(x\)=p(xo). 
Further, note Xq is also the solution to the problem of 
max x /(x) — Xq(p(x) — p(xq)), since we simply add a constant 
Aop(xo). Therefore, 

f(x 0 ) - Mp(xo) -p(xo)) >f(xi) - h(p{x\ ) -p(x 0 )), 
f(x 0 )>f(xi). 



As Xo also satisfies the constraint, we conclude that Xq is an 
optimal solution to the constrained optimization problem. 



We use this fact to find the information-maximizing noise 
correlations, with the restriction that the noise correlations by 
small in magnitude. For a given Xq, we perform the optimization 
max x /(x) — Xqp(x), where /(■) in this case is one of our 
information measures, x refers to the off-diagonal elements of 
the covariance matrix, and p(x) is the measure of the correlation 
strength as in Eq. (58). Thanks to the above result, we can be 
assured that the resulting covariance matrix (described by x) will 
be the one that maximizes the information for a particular strength 
of correlations. By varying Xq (or r in Eq. (58)), we can thus 
parametrically explore how the optimal correlation structures 
change as one allows either larger, or smaller, correlations in the 
system. 

Penalty function. In Section "Heterogeneously tuned neural 
populations", our aim is to plot optimized noise correlations at 
various levels of the correlation strength, as quantified by the 
Euclidean norm. This constrained optimization problem can be 
achieved, as shown in the previous section, by adding a term to the 
information that penalizes the Euclidean norm — that is, a 
constant times the sum-of-squares of correlations. This is precisely 
the procedure that we follow, ranging over a number of different 
values of the constant to produce the plot of Fig. 4. 

In more detail, we choose these different values of the constant 
as follows. To force the correlations towards a fixed strength of r, 
we optimize a modified objective function with an additional term: 



/(C)- 



2r 



><J \ 'J 



(58) 



As will become clear, the term before the sum is a constant with 
respect to the terms being optimized; from one optimization to the 
next, we adjust the value of r in this term. Here the variance terms 
Vy = ^CftCjj are constants to scale C"j properly as correlation 
coefficients. Also, G is the gradient vector of /(•) at C=D" (the 
diagonal matrix corresponding independent noise) with respect to 
off-diagonal entries of C (see the remarks after the proof of 
Theorem 1). G * V c means the entry-wise product of the two 



vectors (of length 



N(N-l) 



indexed by (ij),i<j). Note that ||-|| 2 is 



the ordinary vector 2-norm. 

To understand this choice of the constant in (58), note that the 
new optimal correlations with the penalty can be characterized by 
setting the gradient of the total objective function to 0. In a small 
neighborhood of D", the gradient of /(•) is close to G. With these 
substitutions, the equation for the gradient of the total objective 
function yields approximately: 



G— 



or, 



iig*^ii 2 [ q 



><j 



G * V c 

\G*VC\\ 2 



c n . 

{^} 

•j 

r 



(59) 



where we took an entry-wise product with V c and rearranged 
terms to obtain the final equality. The final equality implies that 

c " 

the (vector) 2-norm of noise correlations {— '{•} (i.e., the Euclidean 

•j 

norm) is approximately r. This is what we set out to achieve with 
the additional term in the objective function. 

Rescaling signal correlation. In Fig. 5 DEF, we make 
scatter plots comparing noise correlations with the rescaled signal 
correlations. Here, we explain how and why this rescaling was done. 
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First, we note that the rescaling is done by multiplying each 
signal correlation by a positive weight. This will not change its 
sign, the property associated with the sign rule (Fig. 5 ABC). 

Next, recall that in deriving the sign rule (Eq. (20)), we 
calculated the gradient of the information with respect to noise 
correlations. One should expect alignment between this gradient 
and the optimal correlations when their magnitudes are small. In 
other words, if we make a scatter plot with dots whose y and x 
coordinates are entries of the gradient and noise correlation 
vectors, respectively (so that the number of dots is the length of 
these vectors), we expect to see that a straight line will pass through 
all the dots. 

We next note that the entries of the gradient vector G * V c are 
not exactiy the normalized signal correlations (see Eq. (59)). 
Instead, this vector has additional "weight factors" that differ for 
each entry (neuron pair), and hence for each dot in the scatter 
plot. Thus, to reveal a linear relationship between signal and 
noise correlations in a scatter plot, we must scale each signal 
correlation with a proper (positive) weight, determined below, so 
that p* lg -> Vy VyPy S = [universal constant] -(G * V c )y. We then 
redo the scatter plots with these new values on the horizontal 
axis. As we will see, the weights VyVy (defined below) do not 
depend on the noise correlations. 
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